DMW eBook TechKnowledge
DMW eBook TechKnowledge
Warehousing
Elective I
(Code : 410244(D))
e
(Savitribai Phule Pune University)
g
io eld
ic ow
Copyright © by Authors. All rights reserved. No part of this publication may be reproduced, copied, or stored in a retrieval
system, distributed or transmitted in any form or by any means, including photocopy, recording, or other electronic or
mechanical methods, without the prior written permission of the publisher.
This book is sold subject to the condition that it shall not, by the way of trade or otherwise, be lent, resold, hired out, or
otherwise circulated without the publisher’s prior written consent in any form of binding or cover other than which it is
published and without a similar condition including this condition being imposed on the subsequent purchaser and without
limiting the rights under copyright reserved above.
g e
io eld
First Edition : July 2018
Second Revised Edition : July 2019 (TechKnowledge Publications)
ic ow
n
This edition is for sale in India, Bangladesh, Bhutan, Maldives, Nepal, Pakistan, Sri Lanka and designated countries in
bl kn
South-East Asia. Sale and purchase of this book outside of these countries is unauthorized by the publisher.
at
Pu ch
Te
ISBN 978-93-89299-36-6
Published by
TechKnowledge Publications
g e
io eld
ic ow
n
bl kn
at
Pu ch
Te
Dear Students,
We are extremely happy to present the book of “Data Mining and Warehousing” for you.
We have divided the subject into small chapters so that the topics can be arranged and understood
e
properly. The topics within the chapters have been arranged in a proper sequence to ensure
g
smooth flow of the subject.
io eld
We present this book in the loving memory of Late. Shri. Pradeepji Lunawat, our source
ic ow
We are thankful to Shri. J. S. Katre, Shri. Shital Bhandari, Shri. Arunoday Kumar and
Pu ch
Shri. Chandroday Kumar for the encouragement and support that they have extended. We are also
Te
thankful Sema Lunavat for ebooks and to the staff members of TechKnowledge Publications and
We have jointly made every possible effort to eliminate all the errors in this book. However
if you find any, please let us know, because that will help us to improve further.
e
Pre-requisites Courses
g
310242-Database Management Systems, 310244 - Information Systems and Engineering Economics
io eld
Companion Course : 410247- Laboratory Practice II
Course Objectives
•
ic ow
Course Outcomes
Pu ch
Course Contents
e
(Refer chapter 4)
g
Unit V : Classification (08 Hours)
io eld
Introduction to : Classification and Regression for Predictive Analysis, Decision Tree Induction, Rule-Based
Classification : using IF-THEN Rules for Classification, Rule Induction Using a Sequential Covering Algorithm.
ic ow
Bayesian Belief Networks, Training Bayesian Belief Networks, Classification Using Frequent Patterns, Associative
Classification, Lazy Learners-k-Nearest-Neighbor Classifiers, Case-Based Reasoning. (Refer chapter 5)
n
bl kn
learning and multi-perspective learning. Metrics for Evaluating Classifier Performance : Accuracy, Error Rate,
precision, Recall, Sensitivity, Specificity; Evaluating the Accuracy of a Classifier : Holdout Method, Random Sub
Te
Chapter 1 : Introduction 1-1 to 1-32 1.6.4(A) Need for Data Reduction ............................................ 1-25
e
UNIT II
g
1.1 Data Mining .................................................................. 1-1
io eld
1.1.1 Applications of Data Mining .......................................... 1-1 Chapter 2 : Data Warehouse 2-1 to 2-43
1.1.2 Challenges to Data Mining ........................................... 1-2
Syllabus :
ic ow
1.6.2 Introduction to Data Integration .................................. 1-19 2.2.2 OLAP Vs OLTP ............................................................. 2-4
1.6.2(A) Entity Identification Problem ....................................... 1-20 2.3 A Multidimensional Data Model .................................... 2-5
1.6.2(B) Redundancy and Correlation Analysis ....................... 1-20 2.3.1 What is Dimensional Modelling ?.................................. 2-5
1.6.3 Data Transformation and Data Discretization ........... 1-21 2.3.2 Data Cubes ................................................................... 2-6
1.6.3(A) Data Transformation ................................................... 1-21 2.3.3 Star Schema ................................................................. 2-6
1.6.3(B) Data Discretization ..................................................... 1-22 2.3.4 The Snowflake Schema ................................................ 2-8
1.6.3(C) Data Transformation by Normalization ....................... 1-22 2.3.5 Star Flake Schema ....................................................... 2-8
2.3.6 Differentiate between Star Schema and Snowflake 3.1 Measuring Data Similarity and Dissimilarity.................. 3-1
Schema ........................................................................ 2-8
3.1.1 Data Matrix versus Dissimilarity Matrix ......................... 3-1
2.3.7 Factless Fact Table ...................................................... 2-8
3.2 Proximity Measures for Nominal Attributes and Binary
2.3.8 Fact Constellation Schema or Families of Star .......... 2-10 Attributes, Interval Scaled ............................................. 3-2
2.3.9 Examples on Star Schema and Snowflake Schema .. 2-10 3.2.1 Proximity Measures for Nominal Attributes................... 3-2
2.4 OLAP Operations in the Multidimensional Data 3.2.2 Proximity Measures for Binary Attributes...................... 3-3
Model .......................................................................... 2-24
3.2.3 Interval Scaled .............................................................. 3-6
2.5 Concept Hierarchies ................................................... 2-28
3.3 Dissimilarity of Numeric Data : Minkowski Distance,
2.6 Data Warehouse Architecture .................................... 2-29 Euclidean Distance and Manhattan Distance ............... 3-7
2.7 The Process of Data Warehouse Design ................... 2-31 3.4 Proximity Measures for Categorical, Ordinal Attributes,
Ratio Scaled Variables ................................................. 3-9
2.8 Data Warehousing Design Strategies or Approaches
e
for Building a Data Warehouse .................................. 2-32 3.4.1 Categorical Attributes ................................................... 3-9
g
2.8.1 The Top Down Approach : The Dependent Data Mart 3.4.2 Ordinal Attributes .......................................................... 3-9
io eld
Structure ..................................................................... 2-32
3.4.3 Ratio Scaled Attributes ............................................... 3-10
2.8.2 The Bottom-Up Approach : The Data Warehouse
3.4.4 Discrete Versus Continuous Attributes ....................... 3-11
Bus Structure .............................................................. 2-33
ic ow
UNIT IV
2.8.5 A Practical Approach .................................................. 2-36
Chapter 4 : Association Rules Mining 4-1 to 4-50
at
Pu ch
2.10 Types of OLAP Servers : ROLAP versus MOLAP Market basket Analysis, Frequent item set, Closed item set,
versus HOLAP ............................................................ 2-38 Association Rules, a-priori Algorithm, Generating Association
Rules from Frequent Item sets, Improving the Efficiency of a-
2.10.1 MOLAP ...................................................................... 2-38 priori, Mining Frequent Item sets without Candidate
2.10.2 ROLAP ...................................................................... 2-39 Generation : FP Growth Algorithm; Mining Various Kinds of
Association Rules : Mining multilevel association rules, constraint
2.10.3 HOLAP ....................................................................... 2-40
based association rule mining, Meta rule-Guided Mining of
2.10.4 DOLAP ....................................................................... 2-40 Association Rules.
2.11 Examples of OLAP ..................................................... 2-40 4.1 Market Basket Analysis ................................................ 4-1
Measuring Data Similarity and Dissimilarity, Proximity Measures 4.3 Closed Itemsets ............................................................ 4-3
for Nominal Attributes and Binary Attributes, interval scaled;
4.4 Association Rules ......................................................... 4-4
Dissimilarity of Numeric Data : Minkowski Distance, Euclidean
distance and Manhattan distance; Proximity Measures for 4.4.1 Finding the Large Itemsets ........................................... 4-4
Categorical, Ordinal Attributes, Ratio scaled variables;
Dissimilarity for Attributes of Mixed Types, Cosine Similarity. 4.4.2 Frequent Pattern Mining ............................................... 4-4
Data Mining & Warehousing (SPPU-Sem 7-Comp) 3 Table of Contents
4.4.3 Efficient and Scalable Frequent Itemset Mining 5.1.3 Issues Regarding Classification and Prediction .......... 5-3
Method.......................................................................... 4-5
5.1.4 Regression .................................................................... 5-4
4.5 A-priori Algorithm .......................................................... 4-5
5.2 Decision Tree Induction Classification Methods ........... 5-6
4.5.1 Advantages and Disadvantages of Apriori Algorithm ... 4-6
5.2.1 Appropriate Problems for Decision Tree Learning ....... 5-7
4.6 Generating Association Rules from Frequent Item
5.2.2 Decision Tree Representation ...................................... 5-7
Sets .............................................................................. 4-7
5.2.3 Algorithm for Inducing a Decision Tree......................... 5-8
4.7 Improving the Efficiency of a-priori ............................... 4-7
5.2.4 Tree Pruning ................................................................. 5-9
4.8 Solved Example on Apriori Algorithm ........................... 4-8
5.2.5 Examples of ID3............................................................ 5-9
4.9 Mining Frequent Item sets without Candidate
Generation : FP Growth Algorithm ............................. 4-34 5.3 Rule-Based Classification : using IF-THEN Rules for
e
Classification ............................................................... 5-44
4.9.1 FP-Tree Algorithm ...................................................... 4-34
g
5.3.1 Rule Coverage and Accuracy ..................................... 5-44
4.9.2 FP-Tree Size .............................................................. 4-36
io eld 5.3.2 Characteristics of Rule-Based Classifier .................... 5-45
4.9.3 Example of FP Tree.................................................... 4-36
5.4 Rule Induction Using a Sequential Covering
4.9.4 Mining Frequent Patterns from FP Tree ..................... 4-40
ic ow
4.10.2 Constraint based Association Rule Mining ................. 4-48 Classification ............................................................... 5-50
4.10.3 Metarule-Guided Mining of Association Rule ............. 4-48 5.7.1 CBA ............................................................................. 5-51
Te
4.11 Solved University Question and Answer .................... 4-49 5.7.2 CMAR ......................................................................... 5-51
UNIT V 5.8 Lazy Learners : (or Learning from your Neighbors) .... 5-51
Syllabus : UNIT VI
5.1 Introduction to : Classification and Regression for Performance : Accuracy, Error Rate, precision, Recall,
Sensitivity, Specificity; Evaluating the Accuracy of a Classifier :
Predictive Analysis ....................................................... 5-1
Holdout Method, Random Sub sampling and Cross-Validation
5.1.1 Classification is a Two Step Process ........................... 5-1
6.1 Multiclass Classification ................................................ 6-1
5.1.2 Difference between Classification and Prediction ....... 5-3
6.1.1 Introduction to Multiclass Classification ........................ 6-1
Data Mining & Warehousing (SPPU-Sem 7-Comp) 4 Table of Contents
g e
io eld
ic ow
n
bl kn
at
Pu ch
Te
1 Introduction
Unit I
Syllabus
Data Mining, Data Mining Task Primitives, Data : Data, Information and Knowledge; Attribute Types : Nominal,
Binary, Ordinal and Numeric attributes, Discrete versus Continuous Attributes; Introduction to Data Preprocessing,
Data Cleaning : Missing values, Noisy data; Data integration : Correlation analysis; transformation : Min-max
e
normalization, z-score normalization and decimal scaling; data reduction : Data Cube Aggregation, Attribute Subset
Selection, sampling; and Data Discretization : Binning, Histogram Analysis.
g
io eld
1.1 Data Mining
− Data Mining is a new technology, which helps organizations to process data through algorithms to uncover meaningful
ic ow
patterns and correlations from large databases that otherwise may not be possible with standard analysis and
reporting.
n
bl kn
− Data mining tools can help to understand the business better and also improve future performance through predictive
analytics and make them proactive and allow knowledge driven decisions.
at
− Issues related to information extraction from large databases, data mining field brings together methods from several
Pu ch
domains like Machine Learning, Statistics, Pattern Recognition, Databases and Visualization.
− Data mining field finds its application in market analysis and management like for e.g. customer relationship
Te
management, cross selling, market segmentation. It can also be used in risk analysis and management for forecasting,
customer retention, improved underwriting, quality control, competitive analysis and credit scoring.
Definition of Data Mining
(SPPU - Dec. 15)
− Data mining is the process of analysing large amounts of data stored in a data warehouse for useful information which
makes use of artificial intelligence techniques, neural networks and advanced statistical tools (such as cluster analysis)
to reveal trends, patterns and relationships which otherwise may be undetected.
− Data Mining is a non-trivial process of identifying :
o Valid
o Novel
o Potentially useful, understandable patterns in data.
− For example in banking sector data mining can be used for customer retention, fraud prevention by credit card
approval and fraud detection.
− Prediction models can be developed to help analyze data collected over years. For e.g. customer data can used to find
out whether the customer can avail loan from the bank, or an accident claim is fraudulent and needs further
investigation.
− Effectiveness of a medicine or certain procedure may be predicted in medical domain by using data mining.
− Data mining can be used in Pharmaceutical firms as a guide to research on new treatments for diseases, by analyzing
chemical compounds and genetic materials.
− A large amount of data in retail industry like purchasing history, transportation services may be collected for analysis
purpose. This data can help multidimensional analysis, sales campaign effectiveness, customer retention and
recommendation of products and much more.
− Telecommunication industry also uses data mining, for e.g. they may do analysis based on the customer data which of
them are likely to remain as subscribers and which one will shift to competitors.
g e
1.1.2 Challenges to Data Mining io eld
(SPPU - Oct. 16)
Q. Describe three challenges to data mining regarding data mining methodology. (Oct. 16, 6 Marks)
ic ow
knowledge discovery tasks such as data characterization, discrimination, association, classification, clustering,
trend and deviation analysis, and similarity analysis.
at
Pu ch
− Each of these tasks will use the same database in different ways and will require different data mining
techniques.
Te
− Interactive mining, with the use of OLAP operations on a data cube, allows users to focus the search for patterns,
providing and refining data mining requests based on returned results.
− The user can then interactively view the data and discover patterns at multiple granularities and from different
angles.
Q. Explain the knowledge discovery in database (KDD) with diagram. What is the role of data mining steps in KDD ?
(Aug. 17, 6 Marks)
− The process of discovering knowledge in data and application of data mining methods refers to the term Knowledge
Discovery in Databases (KDD).
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-3 Introduction
− It includes a wide variety of application domains, which include Artificial Intelligence, Pattern Recognition, Machine
Learning Statistics and Data Visualisation.
− The main goal includes extracting knowledge from large databases, the goal is achieved by using various data mining
algorithms to identify useful patterns according to some predefined measures and thresholds.
Outline steps of the KDD process
g e
io eld
ic ow
The overall process of finding and interpreting patterns from data involves the repeated application of the following
steps :
at
Pu ch
1. Developing an understanding of
Selecting a data set, or focusing on a subset of variables, or data samples, on which discovery is to be performed.
(i) Based on the goal of the task, useful features are found to represent the data.
(ii) The number of variables may be effectively reduced using methods like dimensionality reduction or
transformation. Invariant representations for the data may also be found out.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-4 Introduction
Selecting the appropriate Data mining tasks like classification, clustering, regression based on the goal of the KDD
process.
(i) Pattern search is done using the appropriate Data Mining method(s).
(ii) A decision is taken on which models and parameters may be appropriate.
(iii) Considering the overall criteria of the KDD process a match for the particular data mining method is done.
7. Data mining
Using a representational form or other representations like classification, rules or trees, regression clustering for
searching patterns of interest.
e
8. Interpreting mined patterns
g
9. Consolidating discovered knowledge
io eld
The terms knowledge discovery and data mining are distinct.
KDD Data Mining
ic ow
KDD is a field of computer science, which helps humans in Data Mining is one of the step in the KDD process, it
extracting useful, previously undiscovered knowledge from applies the appropriate algorithm based on the goal
n
data. It makes use of tools and theories for the same. of the KDD process for identifying patterns from data.
bl kn
Architecture of a typical data mining system may have the following major components as shown in Fig. 1.1.2.
Te
These are information repositories. Data cleaning and data integration techniques may be performed on the data.
It fetches the data as per the user’s requirement which is need for data mining task.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-5 Introduction
3. Knowledge base
This is used to guide the search, and gives the interesting and hidden patterns from data.
It performs the data mining task such as characterization, association, classification, cluster analysis etc.
It is integrated with the mining module and it helps in searching only the interesting patterns.
This module is used to communicate between user and the data mining system and allow users to browse database or
data warehouse schemas.
g e
io eld (SPPU - Oct. 18)
Data mining primitives define a data mining task, which can be specified in the form of a data mining query.
ic ow
n
bl kn
at
Pu ch
Te
3. Background knowledge
− It is the information about the domain to be mined.
− Concept hierarchies is the form of background knowledge which helps to discover the knowledge at multiple
levels of abstraction.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-6 Introduction
g e
io eld
ic ow
a) Schema hierarchies
It is the total or partial order among attributes in the database schema.
at
Pu ch
Example : Location hierarchy as street < city < province/state < country
b) Set-grouping hierarchies
Te
c) Operation-derived hierarchies
It is based on operation specified which may include decoding of information-encoded strings, information extraction
from complex data objects, data clustering.
Example : URL or email address
[email protected] gives login name
< dept. <univ.< country
d) Rule-based hierarchies
It occurs when either whole or portion of a concept hierarchy is defined as a set of rules and is evaluated dynamically
based on current database data and rule definition.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-7 Introduction
Example
Following rules are used to categorize items as low_profit, medium_profit and high_profit_margin.
4. Interestingness measures
e
− Patterns not meeting the threshold are not presented to the user.
g
Objective measures of pattern interestingness :
io eld
1. Simplicity : A patterns interestingness is based on its overall simplicity for human comprehension.
3. Certainty (confidence) : Assesses the validity or trustworthiness of a pattern. Confidence is a certainty measure.
Confidence (A=>B) = ⎛
(Number of tuples containing both A and B) ⎞
n
⎝
(Number of tuples containing A) ⎠
bl kn
5. Novelty : Patterns contributing new information to the given pattern set are called novel patterns.
Te
− Data mining systems should be able to display the discovered patterns in multiple forms, such as rules, tables,
crosstabs (cross-tabulations), pie or bar charts, decision trees, cubes, or other visual representations.
− User must be able to specify the forms of presentation to be used for displaying the discovered patterns.
− Data represents a single primary entities and the related transaction of that entity. Data are facts, which are not
processed or analyzed. Example : “The price of petrol is Rs. 80 per litre”.
− Information is obtained after processing the data and then data has been interpreted and analysed. Such information
is meaningful and useful to the user. Example : “The price of petrol is increased from Rs. 80 to Rs. 85 in last 3 months”.
This information is useful for user who keeps a track of the petrol prices.
− Knowledge is useful to take decisions and actions for business. Information is transformed into knowledge. Example
“When the petrol prices increases, it is likely that transportation cost also increases”.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-8 Introduction
− So to get boundaries between data, information and knowledge is not easy. Sometimes data may be information for
others. But finally knowledge helps to take action for business and delivers the matrix or value for decision makers to
take decisions.
(b) Knowledge leads to action and BI delivers value for decision makers
Fig 1.3.1
e
1.4 Attributes Types
g
Data Objects io eld
− A data object is a logical cluster of all tables in the data set which contains data related to the same entity. It also
represents an object view of the same.
ic ow
− Example : In a product manufacturing company, product, customer are objects. In a retail store, employee, customer,
items and sales are objects.
n
bl kn
− Every data object is described by its properties called as attributes and it is stored in the database in the form of a row
or tuple. The columns of this data tuple are known to be attributes.
at
Pu ch
Attributes types
− An attribute is a property or characteristic of a data object. For e.g. Gender is a characteristic of a data object person.
Te
− Nominal attributes are also called as Categorical attributes and allow for only qualitative classification.
− Every individual item has a certain distinct categories, but quantification or ranking the order of the categories is
not possible.
Examples
− A nominal attribute which has either of the two states 0 or 1 is called Binary attribute, where 0 means that the
attribute is absent and 1 means that it is present.
e
− Symmetric binary variable : If both of its states i.e. 0 and 1 are equally valuable. Here we cannot decide which
g
outcome should be 0 and which outcome should be 1.
io eld
− Example : Marital status of a person is “Married or Unmarried”. In this case both are equally valuable and difficult
to represent in terms of 0(absent) and 1(present).
ic ow
− Asymmetric binary variable : If the outcome of the states are not equally important. An example of such a
variable is the presence or absence of a relatively rare attribute. For example : Person is “handicapped or not
n
bl kn
handicapped”. The most important outcome is usually coded as 1 (present) and the other is coded as 0 (absent).
− A discrete ordinal attribute is a nominal attribute, which have meaningful order or rank for its different states.
− The interval between different states is uneven due to which arithmetic operations are not possible, however
Te
Examples
Considering age as an ordinal attribute, it can have three different states based on an uneven range of age value.
Similarly income can also be considered as an ordinal attribute, which is categorised as low, medium, high based on
the income value.
Age : 1. Teenage
2. Young
3. Old
Income : 1. Low
2. Medium
3. High
Numeric Attributes are quantifiable. It can be measured in terms of a quantity, which can either have an integer or
real value. They can be of two types,
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-10 Introduction
e
2. Ratio scaled attributes
g
− Ratio scaled attributes are continuous positive measurements on a non linear scale. They are also interval scaled
io eld
data but are not measured on a linear scale.
− Operations like addition, subtraction can be performed but multiplication and division are not possible.
ic ow
− Example : For instance, if a liquid is at 40 degrees and we add 10 degrees, it will be 50 degrees. However, a liquid
at
n
bl kn
40 degrees does not have twice the temperature of a liquid at 20 degrees because 0 degrees does not represent
“no temperature”.
at
Pu ch
results.
o As continuous ordinal scale.
o Transforming the data (for example, logarithmic transformation) and then treating the results as interval
scaled variables.
− If an attribute can take any value between two specified values then it is called as continuous else it is discrete.
An attribute will be continuous on one scale and discrete on another.
− Example : If we try to measure the amount of water consumed by counting the individual water molecules then it
will be discrete else it will be continuous.
− Examples of continuous attributes includes time spent waiting, direction of travel, water consumed etc.
− Examples of discrete attributes includes voltage output of a digital device, a person’s age in years.
− Process that involves transformation of data into information through classifying, sorting, merging, recording,
retrieving, transmitting, or reporting is called data processing. Data processing can be manual or computer based.
− In Business related world, data processing refers to data processing so as to enable effective functioning of the
organisations and businesses.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-11 Introduction
− Computer data processing refers to a process that takes the data input via a program and summarizes, analyse the
same or convert it to useful information.
− The processing of data may also be automated.
− Data processing systems are also known as information systems.
− When data processing does not involve any data manipulation and only converts the data type it may be called as
data conversion.
Q. What are the major tasks in data preprocessing ? Explain them in brief. (Oct. 16, Dec. 16, 6 Marks)
g e
io eld
ic ow
n
bl kn
Data cleaning is also known as scrubbing. The data cleaning process detects and removes the errors and
inconsistencies and improves the quality of the data. Data quality problems arise due to misspellings during data entry,
Te
− Source Systems data is not clean; it contains certain errors and inconsistencies.
− Specialised tools are available which can be used for cleaning the data.
− Some of the Leading data cleansing vendors include Validity (Integrity), Harte-Hanks (Trillium) and First logic.
g e
io eld Fig. 1.6.2 : Steps in Data Cleansing
1. Parsing
− Parsing is a process in which individual data elements are located and identified in the source systems and then
these elements are isolated in the target files.
ic ow
− Example : Parsing of name into First name, Middle name and Last name or parsing the address into street name,
n
city, state and country.
bl kn
2. Correcting
at
Pu ch
− This is the next phase after parsing, in which individual data elements are corrected using data algorithm and
secondary data sources.
Te
− Example : In the address attribute replacing a vanity address and adding a zip code.
3. Standardizing
− In standardizing process conversion routines are used to transform data into a consistent format using both
standard and custom business rules.
− Example : addition of a prename, replacing a nickname and using a preferred street name.
4. Matching
− Matching process involves eliminating duplications by searching and matching records with parsed, corrected and
standardised data using some standard business rules.
− For example, identification of similar names and addresses.
5. Consolidating
Consolidation involves merging the records into one representation by analysing and identifying relationship between
matched records.
− Data can have many errors like missing data, or incorrect data at one source.
− When more than one source is involved there is a possibility of inconsistency and conflicting data.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-13 Introduction
7. Data staging
− Data staging is an interim step between data extraction and remaining steps.
− Using different processes like native interfaces, flat files, FTP sessions, data is accumulated from asynchronous
sources.
− After a certain predefined interval, data is loaded into the warehouse after the transformation process.
− No end user access is available to the staging file.
− For data staging, operational data store may be used.
Q. Describe the various methods for handling the missing values. (May 16, 6 Marks)
Q. In real-world data, tuples with missing values for some attributes are a common occurrence. Describe various
e
methods for handling this problem. (Dec. 16, 6 Marks)
g
Q. What are missing values? Explain methods to handle missing values. (Dec. 17, 6 Marks)
io eld
Missing data values
− This involves searching for empty fields where values should occur.
ic ow
− Data preprocessing is one of the most important stages in data mining. Real world data is incomplete, noisy or
n
inconsistent, this data is corrected in data preprocessing process by filling out the missing values, smoothening out the
bl kn
− In case of classification suppose a class label is missing for a row, such a data row could be ignored, or many
attributes within a row are missing even in this case data row could be ignored. If the percentage of such rows is
high it will result in poor performance.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-14 Introduction
− Example : Suppose we have to build a model for predicting student success in college. For this purpose a
student’s database having information about age, score, address, etc and column classifying their success in
college to “LOW”, “MEDIUM” and “HIGH”. In this the data rows in which the success column is missing. These
types of rows are of no use in the model therefore they can be ignored.
− This is not feasible for large data set and also time consuming.
− When missing values are difficult to be predicted, a global constant value like “unknown”, “N/A” or “minus
infinity” can be used to fill all the missing values.
− Example : Consider the students database, if the address attribute is missing for some students it does not makes
sense in filling up these values rather a global constant can be used.
g e
4. Use attribute mean io eld
− For missing values, mean or median of its discrete values may be used as a replacement.
− Example : In a database of family incomes, missing values may be replaced with the average income.
ic ow
5. Use attribute mean for all samples belonging to the same class
− Instead of replacing the missing values by mean or median of all the rows in the database, rather we could
n
consider class wise data for missing values to be replaced by its mean or median to make it more relevant.
bl kn
− Example : Consider a car pricing database with classes like “luxury” and “low budget” and missing values need to
at
Pu ch
filled in, replacing missing cost of a luxury car with average cost of all luxury car makes the data more accurate.
− Missing values may also be filled up by using techniques like regression, inference based tools using Bayesian
formalism, decision trees, clustering algorithms.
− For example, clustering method may be used to form clusters and then the mean or median of that cluster may
be used for missing value. Decision tree may be used to predict the most probable value based on the other
attributes.
1. Binning
g e
io eld
Fig. 1.6.4 : Different approaches of binning
ic ow
Disadvantages
− The entire range is divided into N intervals, each containing approximately the same number of samples.
− This results in good data scaling.
− Handling categorical attributes may be a problem.
− Example : Let us consider sorted data for e.g. Price in INR
Bin 1 : 4, 8, 9, 15
Bin 2 : 21, 21, 24, 25
Bin 3 : 26, 28, 29, 34
Smoothing by bin means
Replace each value of bin with its mean value.
Bin 1 : 9, 9, 9, 9
Bin 2 : 23, 23, 23, 23
Bin 3 : 29, 29, 29, 29
− Smoothing by bin boundaries
In this method the minimum and maximum values of the bean boundaries is found and each value is replaced
e
with its nearest value either minimum or maximum.
g
Bin 1 : 4, 4, 4, 15 io eld
Bin 2 : 21, 21, 25, 25
Bin 3 : 26, 26, 26, 34
ic ow
Ex. 1.6.1 : For the given attribute AGE values : 16, 16, 180, 4, 12, 24, 26, 28, apply following Binning technique for smoothing
the noise.
n
i) Bin Medians
bl kn
Soln. :
Pu ch
Sort the age in ascending order : 4, 12, 16, 16, 24, 26, 28, 180
Partition into (equal-depth) bins : (N = 2)
Te
e
(2) Median
g
Sort the elements in ascending order
io eld
35 45 50 55 60 65 75
ic ow
Middle element is 55
∴ median is 55
n
bl kn
(3) Mode
Mode is most frequent value in data set. As each number appears one, so the frequency of all number is same.
at
Pu ch
− Median → 55
− 1st Quartile → middle value of lower half
− 3rd Quartile → middle value of super half
− Minimum → 35
− Maximum → 75
35 45 50 55 60 65 75
∴ First Quartile = Q1 = 45
Third Quartile = Q3 = 65
− Perform clustering on attributes values and replace all values in the cluster by a cluster representative.
e
3. Regression
g
− Regression is a statistical measure used to determine the strength of the relationship between one dependent
io eld
variable denoted by Y and a series of independent changing variables.
− Smooth by fitting the data into regression functions.
ic ow
the outcome, while the later uses two or more independent variables to predict the outcome.
− The general form of each type of regression is :
at
Pu ch
Linear Regression : Y = a + bX + u
Te
a = The intercept
b = The slope
u = The regression residual.
− In multiple regressions each variable is differentiated with subscripted numbers.
− Regression uses a group of random variables for prediction and finds a mathematical relationship between them.
This relationship is depicted in the form of a straight line (Linear regression) that approximates all the points in
the best way.
− Regression may be used to determine for e.g. price of a commodity, interest rates, the price movement of an
asset influenced by industries or sectors.
g e
io eld
ic ow
− The state in which the data quality of the existing data is understood and the desired quality of the data is known
at
Pu ch
A coherent data store (e.g. a Data warehouse) is prepared by collecting data from multiple sources like multiple
databases, data cubes or flat files.
− Schema integration
o Integrate metadata from different sources.
o Entity identification problem: identify real world entities from multiple data sources, e.g. A.cust-id ≡B.cust-#.
− Schema integration is an issue as to integrate metadata from different sources is a difficult task.
− Identify real world entities from multiple data sources and their matching is the entity identification problem.
− For example, Roll number in one database and enrollment number in another database refers to the same attribute.
− Such conflicts may create problem for schema integration.
− Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are
different.
e
− Data redundancy occurs when data from multiple sources is considered for integration.
g
−
io eld
Attribute naming may be a problem as same attributes may have different names in multiple databases.
− An attribute may be derived attribute in another table e.g. “yearly income”.
− Redundancy can be detected using correlation analysis.
ic ow
− To reduce or avoid redundancies and inconsistencies data integration must be carried out carefully. This will also
n
improve mining algorithm speed and quality.
bl kn
2
− X (Chi-square) test can be carried out on nominal data to test how strongly the two attributes are related.
at
Pu ch
− Correlation coefficient and covariance may be used with numeric data, this will give a variation between the
attributes.
Te
2
The X (Chi-square)
− It is used to test hypotheses about the shape or proportions of a population distribution by means of sample data.
2
− For nominal data, a correlation relationship between two attributes, P and Q, can be discovered by an X (Chi-square)
test.
− These nominal variables, also called "attribute variables" or "categorical variables", classify observations into a small
number of categories, which are not numbers. It doesn’t work for numeric data.
− Examples of nominal variables include Gender (the possible values are male or female), Marital Status (Married,
unmarried or divorced), etc.
− The Chi-square test is used to test the probability of independence of a distribution of data but does not gives you any
details about the relationship between them.
X = Σ⎡ E ⎤
2 (O – E)
⎣ ⎦
2
Where X = Chi-square
E = Frequency expected which is the amount of subjects that you would expect to find in each
category based on known information.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-21 Introduction
O = Frequency observed which is the amount of subjects you actually found to be in each category in the
present data.
DF = (r – 1) * (c – 1)
where, r is the number of levels for one categorical variable and c is the number of levels for the other categorical
variable.
− Expected frequencies : It is the count which is computed for each level of categorical attribute. The formula for
expected frequency is
e
o nr is the sum of sample observations at level r of attribute X,
g
o nc is the sum of sample observations at level c of attribute Y,
io eld
o n is the total size of sample data.
− Operational databases keep changing with the requirements, a data warehouse integrating data from these multiple
n
bl kn
− The most commonly used process is “Attribute Naming Inconsistency”, as it is very common to use different names to
the same attribute in different sources of data.
Te
− E.g. Manager Name may be MGM_NAME in one database, MNAME in the other.
− In this one set of data names is considered and used consistently in the data warehouse.
− Once the naming consistency is done, they must be converted to a common format.
(ii) To ensure consistency uppercase representation may be used for mixed case text.
(vi) A common format must be used for coded data (e.g. Male/Female, M/F).
− The above conversions are automated and many tools are available for the transformation e.g. DataMapper.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-22 Introduction
− Generalization : In generalization data is replaced by higher level concepts using concept hierarchy.
− Normalization : In normalization, attribute scaling is performed for a specified range.
Example : To transform V in [min, max] to V′ in [0,1], apply
V′ = (V-Min) / (Max-Min)
Scaling by using mean and standard deviation (useful when min and max are unknown or when there are outliers) :
e
1.6.3(B) Data Discretization
g
−
io eld
The range of a continuous attribute is divided into intervals.
− Categorical attributes are accepted by only a few classification algorithms.
ic ow
− By Discretization the size of the data is reduced and prepared for further analysis.
− Dividing the range of attributes into intervals would reduce the number of values for a given continuous attribute.
n
bl kn
Q. What are the different data normalization methods? Explain them in brief. (May 17, 6 Marks)
− Data Transformation by Normalization or standardization is the process of making an entire set of values have a
particular property.
1. Min-Max normalization
− Min-max normalization results in a linear alteration of the original data. The values are within a given range.
− Following formula may be used to perform mapping a v value, of an attribute A from range [minA,maxA] to a new
range [new_minA,new_maxA],
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-23 Introduction
v = 73600 in [12000,98000]
v′ = 0.716 in [0,1] (new range)
Ex. 1.6.1 : Consider the following group of data : 200, 300, 400, 600, 1000
(i) Use the min-max normalization to transform value 600 onto the range [0.0, 1.0]
(ii) Use the decimal scaling to transfer value 600. (SPPU - Oct. 16, 4 Marks)
Soln. :
(i) Min = Minimum value of the given data = 200
e
⎝ ⎠
g
= 1000 – 200 = ⎛ 800 ⎞ * 1 = 0.5
600 – 200 400
⎝ ⎠
io eld
(ii) Decimal scaling for 600
K 3
10 is 10 = 1000
ic ow
600
1000 = 0.6
n
bl kn
Ex. 1.6.2 : Suppose that the minimum and maximum values for the attribute income are $12,000 and $98,000 respectively.
Normalize income value $73,600 to the range [0.0, 1.0] using min-max normalization method.
at
Pu ch
2. Z-score
In Z-score normalization, data is normalized based on the mean and standard deviation. Z-score is also known as Zero
mean normalization.
v′ = (v – meanA) / std_devA
Example
Mean = 20
std_dev = 10
So v' = (– 1, 0, 1)
3. Decimal scaling
Based on the maximum absolute value of the attributes the decimal point is moved. This process is called as Decimal
Scale Normalization.
e
Example : For the range between – 991 and 99,
g
10k is 1000 (k = 3 as we have maximum 3 digit number in the range)
io eld
v′(– 991) = – 0.991 and v′(99) = 0.099
In Discretization by Histogram divide the data into buckets and store average (sum) for each bucket in smaller data
representation.
1. Equal-width histograms
It divides the range into N intervals, each containing approximately same number of samples.
3. V-optimal
Different Histogram types for a given number of buckets are considered and the one with least variance is chosen.
4. MaxDiff
After the sorting process applied to the data, borders of the buckets are defined where the adjacent values have
maximum difference.
Example
1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,15,18,18,18,18,18,
18,18,18,20,20,20,20,20,20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30
e
Histogram of above data sample is,
g
io eld
ic ow
n
bl kn
− Binning (histograms) : This involves representing the attributes into groups called as bins, this will result into
lesser number of attributes.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-26 Introduction
− Clustering : Grouping the data based on their similarity into groups called as clusters.
− Aggregation or generalization.
g e
io eld
Fig. 1.6.11 : Data Reduction Technique
− It reduces the data to the concept level needed in the analysis and uses the smallest (most detailed) level necessary to
n
solve the problem.
bl kn
− Queries regarding aggregated information should be answered using data cube when possible.
Example
at
Pu ch
Q. Enlist the dimensionality reduction techniques for text. Explain any one of them in brief. (May 16, Dec. 17, 6 Marks)
Q. Explain different methods for attribute subset selection (any 2). (Oct. 18, 4 Marks)
− In the mining task during analysis, the data sets of information may contain large number of attributes that may be
irrelevant or redundant.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-27 Introduction
− Dimensionality reduction is a process in which attributes are removed and the resulting dataset is smaller in size.
− This process helps in reducing the time and space complexity required by a data mining technique.
Attribute subset selection refers to a process in which minimum set of attributes are selected in such a way that their
distribution represents the same as the original data set distribution considering all the attributes.
g e
io eld
ic ow
n
bl kn
1. Forward selection
Pu ch
− Determine the best of the original attributes and add it to the set.
− At each step, find the best of the remaining original attributes and add it to the set.
− The procedure combines and selects the best attribute and removes the worst among the remaining attributes.
− For all above method stopping criteria is different and it requires a threshold on the measure used to stop the
attribute selection process.
− Data compression is the process of reducing the number of bits needed to either store or transmit the data. This data
can be text, graphics, video, audio, etc. This can be usually be done with the help of encoding techniques.
− Data compression techniques can be classified into either lossy or lossless techniques. In lossy technique there is a loss
of information whereas in lossless there is no loss.
Lossless compression
− Lossless compression consists of those techniques guaranteed to generate an exact duplication of the input dataset
after a compress/decompress cycle.
− Lossless compression is essentially a coding technique. There are many different kinds of coding algorithms, such as
Huffman coding, run-length coding and arithmetic coding.
Lossy compression
e
− In lossy compression techniques at the cost of data quality one can achieve higher compression ratio.
g
− These types of techniques are useful in applications where data loss is affordable. They are mostly applied to digitized
io eld
representations of analog phenomenon.
− Two methods of lossy data compression :
ic ow
n
bl kn
at
Pu ch
Input parameters
− Complexity O(N).
− Principal Component Analysis (PCA) creates a representation of the data with orthogonal basis vectors, i.e.
eigenvectors of the covariance matrix of the data. This can also be derived using Singular value
e
decomposition(SVD) method. By this projection original dataset is reduced with little loss of information.
g
− PCA is often presented using the eigen value/eigenvector approach of the covariance matrices. But in efficient
io eld
computation related to PCA, it is the Singular Value Decomposition (SVD) of the data matrix that is used.
− A few scores of the PCA and the corresponding loading vectors can be used to estimate the contents of a large
ic ow
data matrix.
− The idea behind this is that by reducing the number of eigenvectors used to reconstruct the original data matrix,
n
the amount of required storage space is reduced.
bl kn
− Numerosity reduction technique refers to reducing the volume of data by choosing smaller forms for data
representation
− Different techniques used for numerosity reduction are :
1. Histograms
− Divide data into buckets and store average (sum) for each bucket.
− A bucket represents an attribute-value/frequency pair.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-30 Introduction
e
o Any example may belong to many clusters in such a case it is said to be overlapping.
g
o Any example belongs to a cluster with certain probability then it is said to be probabilistic.
io eld
o A Hierarchical representation may be used for clusters in which clusters may be at highest level of hierarchy
and subsequently refined at lower levels to form sub clusters.
3. Sampling
ic ow
Types of sampling
at
Pu ch
Te
The objects selected for the sample is not removed from the population. In this technique the same object may be
selected multiple times.
4. Stratified sampling
The data is split into partitions and samples are drawn from each partition randomly.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-31 Introduction
Q. 1 Discuss whether or not each of the following activities is a data mining task. (May 16, 6 Marks)
(i) Computing the total sales of a company.
(ii) Predicting the future stock price of a company using historical records.
(iii) Predicting the outcomes of tossing a pair of dice.
Ans. :
This activity is not a data mining task because the total sales can be computed by using simple calculations.
(ii) Predicting the future stock price of a company using historical records
This activity is a data mining task. Historical records of stock price can be used to create a predictive model called
e
regression, one of the predictive modeling tasks that is used for continuous variables.
g
(iii) Predicting the outcomes of tossing a pair of dice :
io eld
This activity is not a data mining task because predicting the outcome of tossing a fair pair of dice is a probability
calculation, which doesn’t have to deal with large amount of data or use complicate calculations or techniques.
ic ow
Q. 2 Differentiate between Descriptive and Predictive data mining tasks. (Oct. 16, 2 Marks)
n
Ans. :
bl kn
(a) Descriptive mining : To derive patterns like correlation, trends etc. which summarizes the underlying relationship
between data.
at
Pu ch
o Class/Concept description
o Mining of associations
o Mining of correlations
o Mining of clusters
(b) Predictive mining : Predict the value of a specific attribute based on the value of other attributes.
o Mathematical Formulae
o Neural Networks
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 1-32 Introduction
Review Questions
e
g
io eld
ic ow
n
bl kn
at
Pu ch
Te
2 Data Warehouse
Unit II
Syllabus
Data Warehouse, Operational Database Systems and Data Warehouses (OLTP Vs OLAP), A Multidimensional Data
Model : Data Cubes, Stars, Snowflakes, and Fact Constellations Schemas; OLAP Operations in the Multidimensional
Data Model, Concept Hierarchies, Data Warehouse Architecture, The Process of Data Warehouse Design,
e
A three-tier data warehousing architecture, Types of OLAP Servers : ROLAP versus MOLAP versus HOLAP.
g
io eld
2.1 Data Warehouse
Precisely, a data warehouse system proves to be helpful in providing collective information to all its users. It is
ic ow
mainly created to support different analysis, queries that need extensive searching on a larger scale. With the help of Data
warehousing technology, every industry right from retail industry to financial institutions, manufacturing enterprises,
n
government department, airline companies people are changing the way they perform business analysis and strategic
bl kn
decision making.
The term Data Warehouse was defined by Bill Inmon in 1990, in the following way: "A warehouse is a subject-
at
Pu ch
oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making
process".
Te
1. Subject Oriented
Data that gives information about a particular subject instead of about a company's ongoing operations.
2. Integrated
Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole.
3. Time-variant
All data in the data warehouse is identified with a particular time period.
4. Non-volatile
Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a
consistent picture of the business.
Ralph Kimball provided a much simpler definition of a data warehouse i.e. "data warehouse is a copy of transaction
data specifically structured for query and analysis". This is a functional view of a data warehouse. Kimball did not
address how the data warehouse is built like Inmondid, rather he focused on the functionality of a data warehouse.
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-2 Data Warehouse
e
5. High quality data
g
Data in data warehouse is cleaned and transferred into desired format. So data quality is high.
io eld
2.1.2 Features of Data Warehouse
Characteristics/ Features of a Data Warehouse
ic ow
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse :
n
bl kn
at
Pu ch
Te
Example
3. Non-volatile
e
Nonvolatile means that, once data entered into the warehouse, it cannot be removed or changed because the
g
purpose of a warehouse is to analyze the data.
io eld
4. Time Variant
A data warehouse maintains historical data. For Example : A customer record has details of his job, a data warehouse
ic ow
would maintain all his previous jobs (historical information) when compared to a transactional system which only
maintains current job due to which its not possible to retrieve older records.
n
bl kn
2.2.1 Why are Operational Systems not Suitable for Providing Strategic Information ?
Pu ch
The fundamental reason for the inability to provide strategic information is that strategic information has been
Te
These operational systems such as University Record system, inventory management, claims processing, outpatient
billing, and so on are not designed in a way to provide strategic information.
If we need the strategic information, the information must be collected from altogether different types of systems.
Only specially designed decision support systems or informational systems can provide strategic information.
Operational systems are tuned for known transactions and workloads, while workload is not known a priori in a data
warehouse.
Special data organization, access methods and implementation methods are needed to support data warehouse
queries (typically multidimensional queries)e.g., average amount spent on phone calls between 9 AM-5 PM in Pune
during the month of December.
Sr. No. Operational Database System Data Warehouse (or DSS -Decision Support System)
1. Application oriented Subject oriented
2. Used to run business Used to analyze business
3. Detailed data Summarized and refined
4. Current up to date Snapshot data
5. Isolated data Integrated data
6. Repetitive access Ad-hoc access
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-4 Data Warehouse
Sr. No. Operational Database System Data Warehouse (or DSS -Decision Support System)
7. Clerical user Knowledge user (manager)
8. Performance sensitive Performance relaxed
9. Few records accessed at a time (tens) Large volumes accessed at a time (millions)
10. Read/update access Mostly read (batch update)
11. No data redundancy Redundancy present
Database size Database size 100GB – few terabytes
12.
100 MB-100 GB
Q. Differentiate between OLTP and OLAP with example. (Dec. 18, 6 Marks)
e
OLAP (On Line Analytical Processing) supports the multidimensional view of data.
g
OLAP provides fast, steady, and proficient access to the various views of information.
io eld
The complex queries can be processed.
It’s easy to analyze information by processing complex queries on multidimensional views of data.
ic ow
Data warehouse is generally used to analyse the information where huge amount of historical data is stored.
n
Information in data warehouse is related to more than one dimension like sales, market trends, buying patterns,
bl kn
supplier, etc.
Definition
at
Pu ch
Definition given by OLAP council (www.olapcouncil.org) On-Line Analytical Processing (OLAP) is a category of software
technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive
Te
access in a wide variety of possible views of information that has been transformed from raw data to reflect the real
dimensionality of the enterprise as understood by the user.
Application Differences
Sr. OLTP (On Line Transaction Processing) OLAP (On-Line Analytical Processing)
No.
1. Transaction oriented Subject oriented
2. High Create/Read/Update/ Delete (CRUD) High Read activity
activity
3. Many users Few users
4. Continuous updates – many sources Batch updates – single source
5. Real-time information Historical information
6. Tactical decision-making Strategic planning
7. Controlled, customized delivery “Uncontrolled”, generalized delivery
8. RDBMS RDBMS and/or MDBMS
9. Operational database Informational database
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-5 Data Warehouse
e
Model Differences
Sr.
g OLTP OLAP
io eld
No.
1. Single purpose model – supports Operational Multiple models – support Informational
System. Systems.
ic ow
Analysis.
6. Technical metadata depends on business Technical metadata depends on data mapping
Te
requirements. results.
7. This moment in time is important. Many moments in time are essential elements.
Dimensional model is the underlying data model used by many of the commercial OLAP products available today in
the market.
Dimensional model uses the relational model with some important restrictions.
It is one of the most feasible technique for delivering data to the end users in a data warehouse.
Every dimensional model is composed of at least one table with a multipart key called the fact table and a set of other
related tables called dimension tables.
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-6 Data Warehouse
Multidimensional models is used to inhabit data in multi-dimensional matrices like Data Cubes or Hypercubes.
A standard spreadsheet, signifying a conventional database, is a two-dimensional matrix. One example would be a
spreadsheet of regional sales by product for a particular time period. Products sold with respect to region can be
shown in 2 dimensional matrix but as one more dimension like time is added then it produces 3 dimensional matrix as
shown in Fig. 2.3.1.
g e
io eld
ic ow
n
bl kn
at
Pu ch
Te
Star Schema is the most popular schema design for a data warehouse. Dimensions are stored in a Dimension table and
every entry has its own unique identifier.
Every Dimension table is related to one or more fact tables. All the unique identifiers (primary keys) from the
dimension tables make up for a composite key in the fact table.
The fact table also contains facts. For example, a combination of store_id, date_key and product_id giving the amount
of a certain product sold on a given day at a given store.
Foreign keys for the dimension tables are contained in a fact table. For eg.(date key, product id and store_id) are all
three foreign keys.
In a dimensional modeling fact tables are normalised, whereas dimension tables are not. The size of the fact tables is
large as compared to the dimension tables. The Facts in the star schema can be classified into three types.
g e
io eld
ic ow
Additive facts are facts that can be summed up through all of the dimensions in the fact table.
at
Pu ch
(ii) Semi-additive
Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Te
Example : Bank Balances : You can take a bank account as Semi- Additive since a current balance for the account can't
be summed as time period; but if you want see current balance of a bank you can sum all accounts current balance.
(iii) Non-additive
Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
Example : Ratios, Averages and Variance
e
It is a hybrid structure (i.e. star schema + snowflake schema).
g
Every fact points to one tuple in each of the dimensions and has additional attributes.
io eld
Does not capture hierarchies directly.
Straightforward means of capturing a multiple dimension data model using relations.
ic ow
Q. Differentiate between Star schema and Snowflake schema. (Dec. 18, 6 Marks)
1. Star schema contains the dimension tables A Snowflake schema contains in-depth joins because the
mapped around one or more fact tables. tables are split in to many pieces.
Te
Factless table means only the key available in the Fact there is no measures available.
Used only to put relation between the elements of various dimensions.
Are useful to describe events and coverage, i.e. the tables contain information that something has/has not happened.
Often used to represent many-to-many relationships.
The only thing they contain is a concatenated key, they do still however represent a focal event which is identified by
the combination of conditions referenced in the dimension tables.
An Example of Factless fact table can be seen in the Fig. 2.3.4.
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-9 Data Warehouse
g e
io eld Fig. 2.3.5 : Types of Factless Fact Table
Use a factless fact table to track events of interest to the organization. For example, attendance at a cultural
event can be tracked by creating a fact table containing the following foreign keys (i.e. links to dimension
ic ow
tables): event identifier, speaker/entertainer identifier, participant identifier, event type, date. This table can
then be queried to find out information, such as which cultural events or event types are the most popular.
n
bl kn
Following example shows factless fact table which records every time a student attends a course or Which class
has the maximum attendance? Or What is the average number of attendance of a given course?
at
Pu ch
All the queries are based on the COUNT() with the GROUP BY queries. So we can first count and then apply other
aggregate functions such as AVERAGE, MAX, MIN.
Te
2. Coverage Tables
The other type of factless fact table is called Coverage table by Ralph. It is used to support negative analysis report.
For example a Store that did not sell a product for a given period. To produce such report, you need to have a fact
table to capture all the possible combinations. You can then figure out what is missing.
Fact Constellation
As its name implies, it is shaped like a constellation of stars (i.e., star schemas).
This schema is more complex than star or snowflake varieties, which is due to the fact that it contains multiple fact
tables.
This allows dimension tables to be shared amongst the fact tables.
A schema of this type should only be used for applications that need a high level of sophistication.
For each star schema or snowflake schema it is possible to construct a fact constellation schema.
That solution is very flexible, however it may be hard to manage and support.
The main disadvantage of the fact constellation schema is a more complicated design because many variants of
e
aggregation must be considered.
g
In a fact constellation schema, different fact tables are explicitly assigned to the dimensions, which are for given facts
relevant.
io eld
This may be useful in cases when some facts are associated with a given dimension level and other facts with a deeper
dimension level.
ic ow
Use of that model should be reasonable when for example, there is a sales fact table (with details down to the exact
date and invoice header id) and a fact table with sales forecast which is calculated based on month, client id and
n
product id.
bl kn
In that case using two different fact tables on a different level of grouping is realized through a fact constellation
at
Pu ch
model.
Family of stars
Te
Soln. :
(a) Star Schema
g e
io eld
ic ow
g e
io eld
ic ow
n
bl kn
at
Pu ch
Ex. 2.3.2 : The Mumbai university wants you to help design a star schema to record grades for course completed by
students. There are four dimensional tables namely course_section, professor, student, period with attributes
as follows :
Course_section Attributes : Course_Id, Section_number, Course_name, Units, Room_id, Roomcapacity.
During a given semester the college offers an average of 500 course sections
Professor Attributes : Prof_id, Prof_Name, Title, Department_id, department_name
Student Attributes : Student_id, Student_name, Major. Each Course section has an average of 60 students
Period Attributes : Semester_id, Year. The database will contain Data for 30 months periods. The only fact
that is to be recorded in the fact table is course Grade
Answer the following Questions
(a) Design the star schema for this problem
(b) Estimate the number of rows in the fact table, using the assumptions stated above and also estimate
the total size of the fact table (in bytes) assuming that each field has an average of 5 bytes.
(c) Can you convert this star schema to a snowflake schema ? Justify your answer and design a
snowflake schema if it is possible.
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-13 Data Warehouse
Soln. :
(a) Star Schema
g e
io eld
ic ow
n
bl kn
Now, Number of rows of fact table = 30000*5 = 150000 (one student has 5 grades for 5 semesters)
Yes, the above star schema can be converted to a snowflake schema, considering the following assumptions
Courses are conducted in different rooms, so course dimension can be further normalized to rooms dimension as
shown in the Fig. P. 2.3.2(a).
Professor belongs to a department, and department dimension is not added in the star schema, so professor
dimension can be further normalized to department dimension.
Similarly students can have different major subjects, so it can also be normalized as shown in the Fig. P. 2.3.2(a).
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-14 Data Warehouse
g e
io eld
Fig. P. 2.3.2(a) : University Snowflake Schema
ic ow
Ex. 2.3.3 : Give Information Package for recording information requirements for “Hotel Occupancy” considering
n
bl kn
dimensions like Time, Hotel etc. Design star schema from the information package.
Soln. :
at
Pu ch
Information package diagram is the approach to determine the requirement of data warehouse.
Te
It gives the metrics which specifies the business units and business dimensions.
The information package diagram defines the relationship between the subject or dimension matter and key
performance measures (facts).
The information package diagram shows the details that users want so its effective for communication between the
user and technical staff.
Table P. 2.3.3 : Information Package for Hotel Occupancy
Facts
(d) No of occupants
(e) Revenue
g e
io eld
ic ow
n
bl kn
at
Pu ch
Te
Ex. 2.3.4 : For a Supermarket Chain consider the following dimensions, namely Product, store, time , promotion. The
schema contains a central fact tables sales facts with three measures unit_sales, dollars_sales and
dollar_cost.
Design star schema and calculate the maximum number of base fact table records for the values given below :
Product : 40,000 products in each store(about 4000 sell in each store daily)
Promotion : a sold item may be in only one promotion in a store on a given day
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-16 Data Warehouse
Soln. :
(a) Star schema
g e
io eld
ic ow
n
bl kn
at
Pu ch
Te
Promotion = 1
= 2 billion
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-17 Data Warehouse
Ex. 2.3.5 : Draw a Star Schema for Student academic fact database.
Soln. :
g e
io eld
ic ow
n
bl kn
at
Pu ch
Te
Ex. 2.3.6 : List the dimensions and facts for the Clinical Information System and Design Star and Snow Flake Schema.
Soln. :
Dimensions
1. Patient 2. Doctor
3. Procedure 4. Diagnose
5. Date of Service 6. Location
7. Provider
Facts
1. Adjustment
2. Charge
3. Age
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-18 Data Warehouse
g e
io eld
ic ow
n
bl kn
at
Pu ch
g e
io eld
ic ow
n
bl kn
at
Pu ch
Ex. 2.3.8 : A manufacturing company has a huge sales network. To control the sales, it is divided in the regions. Each
region has multiple zones. Each zone has different cities. Each sales person is allocated different cities. The
object is to track sales figure at different granularity levels of region. Also to count no. Of products sold. Create
data warehouse schema to take into consideration of above granularity levels for region, sales person and the
quarterly, yearly and monthly sales.
Soln. :
Ex. 2.3.9 : A bank wants to develop a data warehouse for effective decision-making about their loan schemes. The bank
provides loans to customers for various purposes like House Building loan, car loan, educational loan,
personal loan etc. The whole country is categorized into a number of regions, namely, North, South, East,
West. Each region consists of a set of states; loan is disbursed to customers at interest rates that change from
time to time. Also, at any given point of time, the different types of loans have different rates. That data
warehouse should record an entry for each disbursement of loan to customer. With respect to the above
business scenario.
(i) Design an information package diagram. Clearly explain all aspects of the diagram.
(ii) Draw a star schema for the data warehouse clearly identifying the fact tables, dimension tables, their
attributes and measures.
Soln. : (i)
e
Time Customer Branch Location
g
Time_key Customer_key Branch_key Location_key
io eld
Day Account_number Branch_Area Region
Year
at
Holiday_flag
Pu ch
g e
io eld
ic ow
n
bl kn
at
Pu ch
Soln. :
e
Fig. P. 2.3.12 : Star schema for for autosales analysis of company
g
Ex. 2.3.13 : Consider the following database for a chain of bookstores.
io eld
BOOKS (Booknum, Primary_Author, Topic, Total_Stock, Price)
BOOKSTORE (Storenum, City, State, Zip, inventory_Value)
ic ow
(b) Design a star schema for the data warehouse clearly identifying the fact tables(s), Dimension table(s),
their attributes and measures.
Te
Soln. :
b) Star Schema
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-23 Data Warehouse
Ex. 2.3.14 : One of India’s large retail departmental chains, with annual revenues touching $2.5 billion mark and having
over 3600 employees working at diverse locations, was keenly interested in a business intelligence solution
that can bring clear insights on operations and performance of departmental stores across the retail chain. The
company needed to support a data warehouse that exceeds daily sales data from Point of Sales (POS) across
all locations, with 80 million rows and 71 columns.
(a) List the dimensions and facts for above application.
(b) Design star schema and snow flake schema for the above application.
Soln. :
a) Dimensions : Product, Store, Time, Location
Facts : Unit Sales, Dollar Sales, Dollar Cost
b) Star Schema and snowflake Schema
g e
io eld
ic ow
n
bl kn
at
Pu ch
Te
g e
io eld
ic ow
n
bl kn
at
Pu ch
Te
Example
Let us consider a company of Electronic Products. Data cube of company consists of 3 dimensions Location
(aggregated with respect to city), Time (is aggregated with respect to quarters) and item (aggregated with respect to item
types).
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-25 Data Warehouse
1. Consolidation or Roll Up
Multi-dimensional databases generally have hierarchies with
respect to dimensions.
Consolidation is rolling up or adding data relationship with
respect to one or more dimensions.
For example, adding up all product sales to get total City data.
For example, Fig. 2.4.2 shows the result of roll up operation
performed on the central cube by climbing up the concept
hierarchy for location.
This hierarchy was defined as the total order street
<city <province_or_state<country.
The roll up operation shown aggregates the data by city to the
e
country by location hierarchy.
g
io eld
Fig. 2.4.1 : OLAP Operations in the
Multidimensional Data Model
ic ow
n
bl kn
at
Pu ch
Te
2. Drill-down
Drill Down is defined as changing the view of data to a greater level of detail.
For example, the Fig. 2.4.3 shows the result of drill down operations performed on the upper cube by stepping
down a concept hierarchy for time defined as day<month<quarter<year.
g e
io eld
ic ow
n
bl kn
at
Pu ch
Te
Slicing and dicing refers to the ability to look at a database from various viewpoints.
Slice operation carry out selection with respect to one dimension of the given cube and produces a sub cube.
For example, Fig. 2.4.4 shows the slice operation where the sales data are selected from the left cube for the
dimension time using the criterion time = “Q1”
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-27 Data Warehouse
g e
io eld Fig. 2.4.4 : Slice operation
4. Dice
ic ow
n
bl kn
at
Pu ch
Te
Dice operation carry out selection with respect to two or more dimensions of the given cube and produces a sub
cube.
For example, the dice operation is performed on the left cube based on three dimension as Location, Time and
Item as shown in Fig. 2.4.5 where the criteria is (location= “Torrento” or “Vancouver”) and (time = “Q1” or “Q2”)
and (item= “home entertainment” or “computer”).
5. Pivot / Rotate
Pivot technique is used for visualization of data. This operation rotates the data axis to give another presentation
of the data.
For example Fig. 2.4.6 shows the pivot operation where the item and location axis in a 2-D slice are rotated.
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-28 Data Warehouse
g e
io eld Fig. 2.4.6 : Pivot operation
This technique is used when there is need to execute a query involving more than one fact table.
Drill through
n
bl kn
This technique uses relational SQL facilities to drill through the bottom level of the data cube.
2.5 Concept Hierarchies
at
Pu ch
The amount of data may be reduced using concept hierarchies. The low level detailed data (for example numerical
values for age) may be represented by higher-level data (e.g. Young, Middle aged or Senior).
The users or experts may perform a partial/total ordering of attributes explicitly at schema level :
By analysing number of distinct values the hierarchies or attribute levels may be generated automatically.
e
The data in a data warehouse comes from operational systems of the organization as well as from other external
g
sources. These are collectively referred to as source systems.
io eld
The data extracted from source systems is stored in an area called data staging area, where the data is cleaned,
transformed, combined, and duplicated to prepare the data in the data warehouse.
The data staging area is generally a collection of machines where simple activities like sorting and sequential
ic ow
processing takes place. The data staging area does not provide any query or presentation services.
n
As soon as a system provides query or presentation services, it is categorized as a presentation server.
bl kn
A presentation server is the target machine on which the data is loaded from the data staging area organized and
stored for direct querying by end users, report writers and other applications.
at
Pu ch
Te
1. Operational Source
The sources of data for the data warehouse are supplied from :
The data from the mainframe systems in the traditional network and hierarchical format.
Data can also come from the relational DBMS like Oracle, Informix.
In addition to these internal data, operational data also includes external data obtained from commercial
databases and databases associated with supplier and customers.
2. Load Manager
The load manager performs all the operations associated with extraction and loading data into the data
warehouse.
e
These operations include simple transformations of the data to prepare the data for entry into the warehouse.
g
io eld
The size and complexity of this component will vary between data warehouses and may be constructed using a
combination of vendor data loading tools and custom built programs.
3. Warehouse Manager
ic ow
The warehouse manager performs all the operations associated with the management of data in the warehouse.
n
This component is built using vendor data management tools and custom built programs.
bl kn
Transformation and merging the source data from temporary storage into data warehouse tables.
Te
Denormalization
o Generation of aggregation.
o Backing up and archiving of data.
o In certain situations, the warehouse manager also generates query profiles to determine which indexes and
aggregations are appropriate.
4. Query Manager
The query manager performs all operations associated with management of user queries.
This component is usually constructed using vendor end-user access tools, data warehousing monitoring tools,
database facilities and custom built programs.
The complexity of a query manager is determined by facilities provided by the end-user access tools and
database.
5. Detailed Data
This area of the warehouse stores all the detailed data in the database schema.
In the majority of cases detailed data is not stored online but aggregated to the next level of details.
The detailed data is added regularly to the warehouse to supplement the aggregated data.
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-31 Data Warehouse
This stores all the predefined lightly and highly summarized (aggregated) data generated by the warehouse
manager.
This area of the warehouse is transient as it will be subject to change on an ongoing basis in order to respond to
the changing query profiles.
The main goal of the summarized information is to speed up the query performance.
As the new data is loaded into the warehouse, the summarized data is updated continuously.
The detailed and summarized data are stored for the purpose of archiving and back up.
The data is transferred to storage archives such as magnetic tapes or optical disks.
e
8. Meta Data
g
The data warehouse also stores all the Meta data (data about data) definitions used by all processes in the
io eld
warehouse.
It is used for variety of purpose including:
The extraction and loading process-Meta data is used to map data sources to a common view of information
ic ow
As part of Query Management process - Meta data is used to direct a query to the most appropriate data source.
The structure of Meta data will differ in each process, because the purpose is different.
at
Pu ch
The main purpose of a data warehouse is to provide information to the business managers for strategic decision-
making.
These users interact with the warehouse using end user access tools.
Some of the examples of end user access tools can be:
o Reporting and Query Tools
o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools
Q. From the architectural point of view, explain different data warehouse models. (Oct. 18, 4 Marks)
g e
io eld
Fig. 2.8.1 : Data Warehousing Design Strategies
ic ow
2.8.1 The Top Down Approach : The Dependent Data Mart Structure
The data flow in the top down OLAP environment begins with data extraction from the operational data sources.
n
bl kn
This data is loaded into the staging area and validated and consolidated for ensuring a level of accuracy and then
transferred to the Operational Data Store (ODS).
at
Pu ch
Data is also loaded into the Data warehouse in a parallel process to avoid extracting it from the ODS.
o Detailed data is regularly extracted from the ODS and temporarily hosted in the staging area for aggregation,
summarization and then extracted and loaded into the Data warehouse.
o The need to have an ODS is determined by the needs of the business.
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-33 Data Warehouse
o If there is a need for detailed data in the Data warehouse then, the existence of an ODS is considered justified.
o Else organizations may do away with the ODS altogether.
o Once the Data warehouse aggregation and summarization processes are complete, the data mart refresh cycles
will extract the data from the Data warehouse into the staging area and perform a new set of transformations
on them.
o This will help organize the data in particular structures required by data marts.
Then the data marts can be loaded with the data and the OLAP environment becomes available to the users.
The data about the content is centrally stored and the rules and control are also centralized.
e
The results are obtained quickly if it is implemented with iterations.
g
Disadvantages of top down approach
io eld
Times consuming process with an iterative method.
This architecture makes the data warehouse more of a virtual reality than a physical reality. All data marts could be
at
Pu ch
located in one server or could be located on different servers across the enterprise while the data warehouse would
be a virtual entity being nothing more than a sum total of all the data marts.
Te
In this context even the cubes constructed by using OLAP tools could be considered as data marts. In both cases the
shared dimensions can be used for the conformed dimensions.
The bottom-up approach reverses the positions of the Data warehouse and the Data marts.
Data marts are directly loaded with the data from the operational systems through the staging area.
The ODS may or may not exist depending on the business requirements. However, this approach increases the
complexity of process coordination.
The data flow in the bottom up approach starts with extraction of data from operational databases into the staging
area where it is processed and consolidated and then loaded into the ODS.
The data in the ODS is appended to or replaced by the fresh data being loaded.
After the ODS is refreshed the current data is once again extracted into the staging area and processed to fit into the
Data mart structure.
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-34 Data Warehouse
Data Mart
g e
io eld
Fig. 2.8.3 : Bottom Up Approach
ic ow
The data from the Data mart then is extracted to the staging area aggregated, summarized and so on and loaded into
the Data Warehouse and made available to the end user for analysis.
n
bl kn
This model strikes a good balance between centralized and localized flexibility.
at
Pu ch
Data marts can be delivered more quickly and shared data structures along the bus eliminate the repeated effort
expended when building multiple data marts in a non-architected structure.
Te
The standard procedure where data marts are refreshed from the ODS and not from the operational databases
ensures data consolidation and hence it is generally recommended approach.
Manageable pieces are faster and are easily implemented.
The Hybrid approach aims to harness the speed and user orientation of the Bottom up approach to the integration of
the top-down approach.
The Hybrid approach begins with an Entity Relationship diagram of the data marts and a gradual extension of the data
marts to extend the enterprise model in a consistent, linear fashion.
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-35 Data Warehouse
These data marts are developed using the star schema or dimensional models.
The Extract, Transform and Load (ETL) tool is deployed to extract data from the source into a non persistent staging
area and then into dimensional data marts that contain both atomic and summary data.
The data from the various data marts are then transferred to the data warehouse and query tools are reprogrammed
to request summary data from the marts and atomic data from the Data Warehouse.
e
Backfilled DW eliminates redundant extracts.
g
io eld
Requires organizations to enforce standard use of entities and rules.
Few query tools can dynamically query atomic and summary data in different databases.
n
2.8.4 Federated Approach
bl kn
integration of heterogeneous data warehouses, data marts and packaged applications that already exist in the
enterprise.
Te
The goal is to integrate existing analytic structures wherever possible and to define the “highest value” metrics,
dimensions and measures and share and reuse them within existing analytic structures.
This may result in the creation of a common staging area to eliminate redundant data feeds or building of a data
warehouse that sources data from multiple data marts, data warehouses or analytic applications.
Hackney-a vocal proponent of this architecture claims that it is not an elegant architecture but it is an architecture
that is in keeping with the political and implementation reality of the enterprise.
Provides a rationale for “band aid” approaches that solve real business problems.
Alleviates the guilt and stress data warehousing managers might experience by not adhering to formalized
architectures.
Provides pragmatic way to share data and resources.
With no predefined end-state or architecture in mind, it may give way to unfettered chaos.
It might encourage rather than dominant in independent development and perpetuate the disintegration of standards
and controls.
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-36 Data Warehouse
1. The first step is to do Planning and defining requirements at the overall corporate level.
4. Consider the series of supermarts one at a time and implement the data warehouse.
5. In this practical approach, first the organizations needs are determined. The key to this approach is that planning is
done first at the enterprise level. The requirements are gathered at the overall level.
6. The architecture is established for the complete warehouse. Then the data content for each supermart is determined.
e
Supermarts are carefully architected data marts. Supermart is implemented one at a time.
g
7. Before implementation checks the data types, field length etc. from the various supermarts, which helps to avoid
io eld
spreading of different data across several data marts.
8. Finally a data warehouse is created which is a union of all data marts. Each data mart belongs to a business process in
ic ow
the enterprise, and the collection of all the data marts form an enterprise data warehouse.
n
2.9 A Three-Tier Data Warehousing Architecture
bl kn
at
Pu ch
Te
Using Application Program interfaces (called as gateways), data is extracted from operational and external sources.
Gateways like, ODBC (Open Database connection), OLE-DB (Open linking and embedding for database), JDBC (Java
Database Connection) is supported by underlying DBMS.
OLAP Engine is either implemented using ROLAP (Relational online Analytical Processing) or MOLAP(Multidimensional
OLAP).
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-37 Data Warehouse
This tier is a client which contains query and reporting tools, Analysis tools, and /or data mining tools.
From the Architecture Point of view there are three data warehouse Models:
The information of the entire organization is collected related to various subjects in enterprise warehouse.
(b) Data Mart
g e
Only some of the possible summary views may be materialized.
io eld
2.9.1 Data Warehouse and Data Marts
A data mart is oriented to a specific purpose or major data subject that may be distributed to support business needs.
n
It is a subset of the data resource. A data mart is a repository of a business organization's data implemented to answer
bl kn
very specific questions for a specific group of data consumers such as organizational divisions of marketing, sales,
operations, collections and others.
at
Pu ch
A data mart is typically established as one dimensional model or star schema which is composed of a fact table and
multi-dimensional table. A data mart is a small warehouse which is designed for the department level. It is often a way
Te
Major problem : If they differ from department to department, they can be difficult to integrate enterprise-wide.
e
Fig. 2.10.1 : Types of OLAP Servers
g
Approaches to OLAP Servers
io eld
In the OLAP world, there are mainly two different types of OLAP servers: Multidimensional OLAP (MOLAP) and
Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.
ic ow
2.10.1 MOLAP
n
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is
bl kn
Advantages of MOLAP
Pu ch
1. Excellent performance
Te
MOLAP cubes are built for fast data retrieval, and are optimal for slicing and dicing operations.
2. Can perform complex calculations
All calculations have been pre-generated when the cube is created.
Disadvantages of MOLAP
g e
io eld
Fig. 2.10.2 : MOLAP Process
2.10.2 ROLAP
ic ow
This methodology relies on manipulating the data stored in the relational database to give the appearance of
n
traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a
bl kn
Advantages of ROLAP
The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In
other words, ROLAP itself places no limitation on amount of data.
2. Can leverage functionalities inherent in the relational database
Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of
the relational database, can therefore leverage these functionalities.
Disadvantages of ROLAP
g e
io eld
Fig. 2.10.3 : ROLAP Process
2.10.3 HOLAP
ic ow
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information,
HOLAP leverages cube technology for faster performance.
n
bl kn
When detail information is needed, HOLAP can “drill through” from the cube into the underlying relational data.
For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while
at
Pu ch
aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 7.0 OLAP Services supports a hybrid OLAP
server.
Te
2.10.4 DOLAP
It is Desktop Online Analytical Processing and variation of ROLAP. It offers portability to users of OLAP. For DOLAP, it
needs only DOLAP software to be present on machine. Through this software, multidimensional datasets are formed and
transferred to desktop machine.
There are four tables, out of 3 dimension tables and 1 fact table.
Dimension tables
Fact Table
Operations
e
1. Slice
g
Slice on fact table with DID = 2 , this cuts the cube at DID = 2 along the time and patient axis thus it will display a slice
io eld
of cube, in which time on x and patient on y axis.
ic ow
n
bl kn
at
Pu ch
2. Dice
It is a sub cube of main cube. Thus it cuts the cube with more than one predicate like dice on cube with DID = 2, and
Te
3. Roll up
It gives summary based on concept hierarchies. Assuming there exists concept hierarchy in patient table as
state->city->location. Then roll up will summarise the charges or count in terms of city or further roll up will give
charges for a particular state etc.
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-42 Data Warehouse
e
4. Drill down
g
It is opposite to roll up that means if currently cube is summarised with respect to city then drill down will also show
io eld
summarisation with respect to location.
ic ow
n
bl kn
at
Pu ch
Te
5. Pivot
It rotates the cube, sub cube or rolled-up or drilled-down cube, thus changing the view of the cube.
Ex. 2.11.2 : All Electronics Company have sales department consider three dimensions namely
(i) Time
(ii) Product
(iii) store
The Schema Contains a central fact table sales with two measures
(i) dollars-cost and
(ii) units-sold
Using the above example describe the following OLAP operations :
(i) Dice (ii) Slice
(iii) Roll-up (iv) drill Down.
Data Mining & Warehousing (SPPU-Sem 7-Comp) 2-43 Data Warehouse
Soln. :
There are four tables, out of these 3 dimension tables and 1 fact table.
For OLAP operations refer Example 2.11.1.
g e
io eld
ic ow
n
bl kn
at
Pu ch
Te
Review Questions
Syllabus
Measuring Data Similarity and Dissimilarity, Proximity Measures for Nominal Attributes and Binary Attributes,
interval scaled; Dissimilarity of Numeric Data : Minkowski Distance, Euclidean distance and Manhattan distance;
Proximity Measures for Categorical, Ordinal Attributes, Ratio scaled variables; Dissimilarity for Attributes of Mixed
e
Types, Cosine Similarity.
g
io eld
3.1 Measuring Data Similarity and Dissimilarity
Data Mining Applications such as Clustering, Classification, outlier Analysis needs a way to assess of how alike or
ic ow
unalike are the objects from one another. For this some measures of similarity and dissimilarity are needed given as follows
n
3.1.1 Data Matrix versus Dissimilarity Matrix
bl kn
− Let us consider a set of n objects with p attributes given by X1=(X11,X12,.... X1p), X2 = (X21, X22,.... X2p) and so on.
Te
− Where Xij is the value for ith object with jth attribute. These objects can be tuples in a relational database or feature
vectors.
− There are mainly two types of data structures for main memory-based clustering algorithms :
Fig. 3.1.1 : Types of data structures for main memory-based clustering algorithms
d(2,1) 0
d(3,1) d(3,2) 0
: : :
d(n,1) d(n,1) … … 0
− In the above dissimilarity matrix d(i,j) refers to the measure of dissimilarity between objects i and j.
e
− Similarity : Similarity in data mining context refers to how much alike two data objects are which can be described by
g
the distance with dimensions representing features of objects where a small distance indicating that the objects are
io eld
highly similar and a large indicates they are not.
− Similarity can also be expressed as, sim(i,j) = 1 – d(i,j).
ic ow
o Two mode matrix : Data Matrix is also called as two mode matrix as it represents two entities objects which are
its features.
n
bl kn
o One mode matrix : Dissimilarity matrix is called as one mode matrix as it only represents one dimension i.e. the
distance.
at
Pu ch
3.2 Proximity Measures for Nominal Attributes and Binary Attributes, Interval Scaled
3.2.1 Proximity Measures for Nominal Attributes
Te
Q. How to compute dissimilarity between categorical variables. Explain with suitable example. (Oct. 18, 4 Marks)
− Nominal attributes are also called as Categorical attributes and allow for only qualitative classification.
− Every individual item has a certain distinct categories, but quantification or ranking the order of the categories is not
possible.
− The nominal attribute categories can be numbered arbitrarily.
− Proximity refers to either similarity or dissimilarity. As defined in Section 3.1 calculate similarity and dissimilarity of
nominal attributes.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 3-3 Measuring Data Similarity and Dissimilarity
Table 3.2.1
Id Types of Property
1 Houses
2 Condos
e
3 co-ops
g 4 bungalows
io eld
− The Table 3.2.1 represents nominal data for an estate agent classifying different types of property. The dissimilarity
matrix for the above example can be calculated as follows :
ic ow
0
n
bl kn
1 0
1 1 0
at
Pu ch
1 1 1 0
Te
− The value in the above matrix is 0 if the objects are similar and it is a 1 if the objects differ.
For example : Marital status of a person is “Married or Unmarried”. In this case both are equally valuable and difficult
to represent in terms of 0(absent) and 1(present).
2. Asymmetric binary variable
− If the outcome of the states are not equally important. An example of such a variable is the presence or absence of a
relatively rare attribute.
For example : Person is “handicapped or not handicapped”. The most important outcome is usually coded as
1 (present) and the other is coded as 0 (absent). A contingency Table 3.2.2 for binary data :
Table 3.2.2
Object n
1 0 Sum
1 A b a+b
Object
e
0 C d c+d
m
g
io eld Sum a+c b+d P
(a) would be the number of variables which are present for both objects.
ic ow
− Simple matching coefficient (invariant, if the binary variable is symmetric) as shown in Equation (3.2.1) :
b+c
d (i, j) = a + b + c + d …(3.2.1)
at
Pu ch
− Jaccard coefficient (non-invariant if the binary variable is asymmetric) as shown in Equation (3.2.2) :
b+c
Te
d (i, j) = a + b + c …(3.2.2)
Example
− Distance between Jai and Raj (i.e. d(Jai, Raj)) is calculated using Equation (3.2.2) and use contingency Table 3.2.4.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 3-5 Measuring Data Similarity and Dissimilarity
=2
=0
=1
b+c
d (i, j) = a + b + c
e
0+1
d(Jai, Raj) = 2 + 0 + 1 = 0.33
g
Similarly, calculate distance for other combination
io eld
1 +1
d (Jai, Jaya) = 1 + 1 + 1 = 0.67
ic ow
1 +2
d (Jaya, Raj) = 1 + 1 + 2 = 0.75
n
− So, Jai and Raj are most likely to have a similar disease with lowest dissimilarity value.
bl kn
Ex. 3.2.1 : Calculate the Jaccard coefficient between Ram and Hari assuming that all binary attributes are a symmetric
at
Pu ch
and for each pair values for an attribute, first one is more frequent than the second. (Dec. 18, 8 Marks)
Object Gender Food Caste Education Hobby Job
Hari M (1) V (1) M (0) L (1) C (0) N (0)
Te
Object n
1 0 Sum
1 a b a+b
Object m 0 c d c+d
Sum a + c b + d P
a : It would be the number of variables which are present for both objects.
Ram
1 0 Sum
1 1 2 a+b
hari 0 1 2 c+d
Sum a + c b + d P
− Such a similarity measure between two objects defined by asymmetric binary attributes is done by Jaccard Coefficient
e
and which is often symbolized by J is given by the following equation :
g
Number of matching presence
J = Number of attributes not involved in 00 matching
io eld
a
J = a+b+c
1
ic ow
Q. What are interval-scaled variables ? Describe the distance measures that are commonly used for computing the
dissimilarity of objects described by such variables. (May 17, 8 Marks)
− Example : weight, height and weather temperature. These attributes allow for ordering, comparing and quantifying
the difference between the values. An interval-scaled attributes has values whose differences are interpretable.
− These measures include the Euclidean, Manhattan, and Minkowski distances.
1. Euclidean L2 d
∑
2
dEue = |Pi – Qi|
i=1
2. City block L1 d
dCB = ∑ |Pi – Qi|
i=1
3. Minkowski Lp p d
∑
P
dMk = |Pi – Qi|
i=1
− For example, changing measurement units for weight form kilograms to pounds or for height from meters to inches,
may lead to a very dissimilar clustering structure. In general, state a variable in minor unit will lead to a larger range
for that variable, and thus a larger effect on the resultant clustering structure. To assist avoid belief on the choice of
measurement units, the data must be standardized. Standardizing measurements attempts to give all variables an
equal weight. This is mainly helpful when given no previous knowledge of the data. However, in some applications,
users can intentionally want to grant more weight to a certain set of variables than to others.
− For example, when clustering basketball player candidates, we may favor to give more weight to the variable height.
Minkowski Distance
e
Minkowski distance formula
g
q q q q
d(i, j) = |xi1 – xj1| + |xi2 – xj2| + …+ |xip – xjp|
io eld
where,
i = (xi1, xi2, …,xip) and j = (xj1, xj2, …, xjp) are two objects with p number of attributes,
ic ow
q is a positive integer
n
bl kn
d(i, j) = 2 2
|xi1 – xj1| + |xi2 – xj2| + …+ |xip – xjp|
2
− Both the Euclidean distance and Manhattan distance satisfy the following mathematic requirements of a distance
function :
d(i,j) ≥ 0
d(i,i) = 0
− Supremum/Chebyshev (if q = ∞)
d(i,j) = maxt |it – jt|
Fig. 3.3.1
e
d2 (cust101, cust102) = (30 – 40) + (1000 – 400) + (20 – 30) ≈ 600.16
g
dmax (cust101, cust102) = |1000 – 400 | = 600
io eld
Ex. 3.3.1 : Calculate the Euclidean distance matrix for given Data points. (Dec. 18, 8 Marks)
Point X Y
ic ow
P1 0 2
P2 2 0
n
bl kn
P3 3 1
P4 5 1
at
Pu ch
Soln. :
d (P1, P2) = (X2 – X1) 2 + (Y2 – Y1) 2
Te
= (1)2 + (– 2)2
= 1+4= 5
= 2.24
d (P1, P3) = (3 – 0) 2 + (1– 2)2
= (3)2 + (– 1)2
= 9 + 1 = 10
= 3.16
d (P1, P4) = (5 – 0) 2 + (1– 2)2
= (5)2 + (– 1)2
= 25 + 1 = 26
= 5.09
Similarly calculate the distance for other points.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 3-9 Measuring Data Similarity and Dissimilarity
Points P1 P2 P3 P4
P3 3.16 1.41 0 2
P4 5.09 3.16 2 0
= (1)2 + (1)2 = 2
e
= 1.41
g
d (P2, P4) =
io eld (5 – 2) 2 + (1– 0)2
= (3)2 + (1)2 = 10
= 3.16
ic ow
= 2
at
Pu ch
3.4 Proximity Measures for Categorical, Ordinal Attributes, Ratio Scaled Variables
(SPPU - Dec. 18)
Te
Nominal attributes are also called as Categorical attributes. Described in Section 3.2.1
Q. How to compute dissimilarity between ordinal variables. Explain with suitable example. (Oct. 18, 4 Marks)
− A discrete ordinal attribute is a nominal attribute, which have meaningful order or rank for its different states.
− The interval between different states is uneven due to which arithmetic operations are not possible, however logical
operations may be applied.
− For example, Considering Age as an ordinal attribute, it can have three different states based on an uneven range of
age value. Similarly income can also be considered as an ordinal attribute, which is categorised as low, medium, high
based on the income value.
− An ordinal attribute can be discrete or continuous. The ordering of it is important e.g. a rank. These attributes can be
treated like interval scaled variables.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 3-10 Measuring Data Similarity and Dissimilarity
− Let us consider f as an ordinal attribute having Mf states. These ordered states define the ranking :
rif ∈ {1, … Mf}
th
− Map the range of each variable onto [0, 1] by replacing ith object in the f variable by,
rif – 1
zif = M – 1
f
Emp Id Income
1 High
2 Low
e
3 Medium
g
4 High
io eld
− The three states for the above income variable are low, medium and high, that is Mf = 3.
− Next we can replace these values by ranks 3(low), 2(medium) and 1(High).
ic ow
− We can now normalise the ranking by mapping rank 1 to 0.0, rank 2 to 0.5 and rank 3 to 1.0.
n
− Next to calculate the distance we can use the Euclidean distance that results in a dissimilarity matrix as :
bl kn
0
at
Pu ch
1.0 0
Te
0.5 0.5 0
0 1.0 0.5 0
− From the above matrix it can be seen that objects 1 and 2 are most dissimilar so are the object 2 and 4.
− Ratio scaled attributes are continuous positive measurements on a non linear scale. They are also interval scaled data
but are not measured on a linear scale.
− Operations like addition, subtraction can be performed but multiplication and division are not possible.
− For example : For instance, if a liquid is at 40 degrees and we add 10 degrees, it will be 50 degrees. However, a liquid
at 40 degrees does not have twice the temperature of a liquid at 20 degrees because 0 degrees does not represent
“no temperature”.
− If an attribute can take any value between two specified values then it is called as continuous else it is discrete.
An attribute will be continuous on one scale and discrete on another.
− For example : If we try to measure the amount of water consumed by counting the individual water molecules then it
will be discrete else it will be continuous.
1. Examples of continuous attributes includes time spent waiting, direction of travel, water consumed etc.
2. Examples of discrete attributes includes voltage output of a digital device, a person’s age in years.
e
− In such cases one of the most preferred approach is to combine all the attributes into a single dissimilarity matrix and
g
computing on a common scale of [0.0 , 1.0]
io eld
− The dissimilarity may be calculated using
δij (f) = 0
n
Where if either
bl kn
Otherwise,
Te
δij (f) = 1
− Cosine similarity is a measure of similarity between two vectors. The data objects are treated as vectors. Similarity is
measured as the angle θ between the two vectors. Similarity is 1 when θ = 0, and 0 when θ = 90°.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 3-12 Measuring Data Similarity and Dissimilarity
i•j n n
cos(i,j) = 2
||i|| × ||j|| i• j = Σ ik jk ||i|| = Σ i
k
k–1 k–1
Since,
e
Then, the similarity between
g
x and y : cos(x, y) = 3/(6.16 * 1) = 0.49
io eld
The dissimilarity between x and y : 1 – cos(x,y) = 0.51
Ex. 3.6.1 : Consider the following vectors x and y, x = [1, i, i, 1] y = [2, 2, 2, 2], Calculate :
ic ow
Soln. :
− Cosine similarity is a measure of similarity between two vectors. The data objects are treated as vectors. Similarity is
measured as the angle θ between the two vectors. Similarity is 1 when θ = 0, and 0 when θ = 90°.
Te
i•j n n
cos(i,j) = 2
||i|| × ||j|| i• j = Σ ik jk ||i|| = Σ i
k
k–1 k–1
Since,
To find the Euclidean Distance between two points or tuples, the formula is given below.
Let Y1 = {y11,y12,y13,……y1n} and Y2 = {y21,y22,y23,……y2n}
n
distance (Y1, Y2) = Σ (y1i – y2i)2
i=1
Here,
x1 = {1,1,1,1} and x2 = {2,2,2,2}
n
distance (x1, y2) = Σ (y1i – y2i )2
i=1
2 2 2 2
e
= (2 – 1) + (2 – 1) + (2 – 1) + (2 – 1)
g
= 4
io eld
= 2
Review Questions
ic ow
Q. 1 Explain different types of data structures for main memory-based clustering algorithms.
n
bl kn
4 Association Rules Mining
Unit IV
Syllabus
Market basket Analysis, Frequent item set, Closed item set, Association Rules, a-priori Algorithm, Generating
Association Rules from Frequent Item sets, Improving the Efficiency of a-priori, Mining Frequent Item sets without
Candidate Generation : FP Growth Algorithm; Mining Various Kinds of Association Rules : Mining multilevel
e
association rules, constraint based association rule mining, Meta rule-Guided Mining of Association Rules.
g
io eld
4.1 Market Basket Analysis
4.1.1 What is Market Basket Analysis?
ic ow
− Market basket analysis is a modelling technique which is also called as affinity analysis, it helps identifying which items
n
are likely to be purchased together.
bl kn
− The market-basket problem assumes we have some large number of items, e.g., “bread”, “milk.”, etc. Customers buy
the subset of items as per their need and marketer gets the information that which things customers have taken
at
Pu ch
together. So the marketers use this information to put the items on different position.
− For Example : If someone buys a packet of milk also tends to buy a bread at the same time
Te
− Market basket analysis is used in deciding the location of items inside a store, for e.g. if a customer buys a packet of
bread he is more likely to buy a packet of butter too, keeping the bread and butter next to each other in a store would
result in customers getting tempted to buy one item with the other.
− The problem of large volume of trivial results can be overcome with the help of differential market basket analysis
which enables in finding interesting results and eliminates the large volume.
− Using differential analysis it is possible to compare results between various stores, between customers in various
demographic groups.
− Some special observations among the rules for e.g. if there is a rule which holds in one store but not in any other (or
vice versa) then it may be really interesting to note that there is something special about that store in the way it has
organized its items inside the store may be in a more lucrative way. These types of insights will improve company
sales.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-2 Association Rules Mining
− Identification of sets of items purchases or events occurring in a sequence, something that may be of interest to direct
marketers, criminologists and many others, this approach may be termed as Predictive market basket analysis.
e
o Analysis of cheque payments made.
o
g
Analysis of services/products taken e.g. a customer who has taken executive credit card is also likely to take
io eld
personal loan of $5,000 or less.
− For a telecom operator :
ic ow
o Special combo offers may be offered to the customers on the products sold together.
o Placement of items nearby inside a store which may result in customers getting tempted to buy one product
Te
Q. Explain the following terms : Closed and maximal frequent itemsets. (May 16, Dec. 16, Aug. 17, 3 Marks)
− An itemset is closed if none of its immediate supersets has the same support as the itemset.
− Consider two itemsets X and Y, if every item of X is in Y but there is at least one item of Y, which is not in X, then Y is
not a proper super-itemset of X. In this case, itemset X is closed.
e
Example
g
Let us consider minimum support = 2.
io eld
− The itemsets that are circled with thick lines are the frequent itemsets as they satisfy the minimum support. Fig. 4.3.1,
Frequent itemsets are { p,q,r,s,pq,ps,qs,rs,pqs}.
ic ow
− The itemsets that are circled with the double lines are closed frequent itemsets. Fig. 4.3.1, closed frequent itemsets
n
are {p,r,rs,pqs}. For example {rs} is closed frequent itemset as all of its superset {prs ,qrs} have support less than 2.
bl kn
− The itemsets that are circled with the double lines and shaded are maximal frequent itemsets. Fig. 4.3.1, maximal
frequent itemsets are {rs,pqs}. For example {rs} is maximal frequent itemset as none of its immediate supersets like
at
Pu ch
Fig. 4.3.1 : Lattice diagram for maximal, closed and frequent itemsets
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-4 Association Rules Mining
− The items or objects in Relational databases, transactional databases or other information repositories are considered
for finding frequent patterns, associations, correlations, or causal structures.
− It searches for interesting relationships among items in a given data set by examining transactions, or shop carts, we
can find which items are commonly purchased together. This knowledge can be used in advertising or in goods
placement in stores.
e
in the set I2”.
g
4.4.1
io eld
Finding the Large Itemsets
− Calculate the support and confidence for each rule generated in the above step.
n
bl kn
− The Rules that fail the minsup and minconf are prunned from the above list.
− The above steps would be a time consuming process, we can have a better approach as given below.
at
Pu ch
Frequent pattern mining is classified in the various ways based on following criteria :
1. Completeness of the pattern to be mined : Here we can mine the complete set of frequent itemset, closed frequent
itemset, constrained frequent itemsets.
2. Levels of abstraction involved in the rule set : Here we use multilevel association rules based on the levels of
abstraction of data.
3. Number of data dimensions involved in the rule : Here we use single dimensional association rule, there is only one
dimension or multidimensional association rule if there is more than one dimension.
4. Types of the values handled in the rule : Here we use Boolean and quantitative association rules.
5. Kinds of the rules to be mined : Here we use association rules and correlation rules based on the kinds of the rules to
be mined.
6. Kinds of pattern to be mined : Here we use frequent itemset mining, sequential pattern mining and structured
pattern mining.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-5 Association Rules Mining
Q. Explain the Apriori algorithm for generation of association rules. How candidate keys are generated using apriori
algorithm. (Aug. 17, 6 Marks.)
g e
Apriori Algorithm for Finding Frequent Itemsets using Candidate Generation
io eld
− The Apriori Algorithm solves the frequent item sets problem.
− The algorithm analyzes a data set to determine which combinations of items occur together frequently.
ic ow
− The Apriori algorithm is at the core of various algorithms for data mining problems. The best known problem is finding
the association rules that hold in a basket - item relation.
n
bl kn
Basic idea
− An itemset can only be a large itemset if all its subsets are large itemsets.
at
Pu ch
frequent.
− Find frequent itemsets frequently with cardinality 1 to k(k-itemset).
Q. Write a pseudo code for Apriori algorithm and explain. (Dec. 15, 6 Marks)
Q. Write Apriori Algorithm and explain it with suitable example. (Dec.17, 6 Marks)
Input :
D: a database of transactions;
Method :
(1) L1 = find_frequent_1-itemsets(D);
(2) for (k = 2; Lk– 1≠φ; k++){
(7) c.count ++ ;
(8) }
(9) Lk = {c ∈Ck| c.count≥ min_sup}
(10) }
(11) return L = Uk Lk;
e
(2) for each itemsetl2∈Lk– 1
g
(3) if (l1 (1) = l2 (l1) ∧ (l1 (2) = l2 (2))
io eld
∧…∧ (l1 [k – 2] = l2 [k – 2]) ∧ (l1 [k – 1] <l2 [k – 1]) then {
(4) c = l1l2; // join step:generate candidates
ic ow
(8) }
(9) return Ck;
Te
Advantages
Disadvantages
1. Although the algorithm is easy to implement it needs many database scans which reduces the overall performance.
2. Due to Database scans, the algorithm assumes transaction database is memory resident.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-7 Association Rules Mining
− Using confidence formula generate strong association rules which satisfy both minimum support and minimum
confidence.
Support
e
− The support of an itemset is the count of that itemset in the total number of transactions, or in other words it is the
g
percentage of the transactions in which the items appear.
io eld
If A => B
# _ tuples_containing_both_A_and_B
Support (A ⇒ B) = total_#_of_tuples
ic ow
− The support(s) for an association rule X => Y is the percentage of transactions in the database that contains X as well
n
as Y i. e.( X and Y together).
bl kn
− An itemset is considered to be a large itemset if its support is above some threshold called minimum support.
at
Confidence
Pu ch
− The confidence or strength for an association rule A => B is the ratio of the number of transactions that contain A as
well as B to the number of transactions that contain A.
Te
− Consider a rule A => B, it is measure of ratio of the number of tuples containing both A and B to the number of tuples
containing A
# _ tuples_containing_both_A_and_B
Confidence (A ⇒ B) = #_tuples_containing_A
Q. How can we improve the efficiency of apriori algorithm. (Dec. 18, 4 Marks)
There are many variations of Apriori algorithm that have been proposed to improve the efficiency, few of them are
given as :
− Hash-based itemset counting : The itemsets can be hashed into corresponding buckets. For a particular iteration a k-
itemset can be generated and hashed into their respective bucket and increase the bucket count, the bucket with a
count lesser than the support should not be considered as a candidate set.
− Transaction reduction : A transaction that does not contain k-frequent itemset will never have k + 1 frequent itemset,
such a transaction should be reduced from future scans.
− Partitioning : In this technique only two database scans are needed to mine the frequent itemsets. The algorithm has
two phases, in the first phase, the transaction database is divided into non overlapping partitions. The minimum
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-8 Association Rules Mining
support count of a partition is min support X number of transactions in that partition. Local frequent itemsets are
found out in each partition.
− The local frequent itemsets may or may not be frequent with respect to the entire database however a frequent
itemset from database has to be frequent in atleast one of the partitions.
− All the frequent itemsets with respect to each partition forms the global candidate itemsets. In the second phase of
the algorithm, a second scan of database for actual support of each item is found, these are global frequent itemsets.
− Sampling : Rather than finding the frequent itemsets in the entire database D, a subset of transactions are picked up
and searched for frequent itemsets. A lower threshold of minimum support is considered as this reduces the
possibility of missing the actual frequent itemset due to a higher support count.
− Dynamic itemset counting : In this the database is partitioned into blocks and is marked by start points. It maintains a
count-so-far, if this count-so-far crosses minimum support, the itemset is added to the frequent itemset collection
e
which can be further used to generate longer candidate itemset.
g
4.8 Solved Example on Apriori Algorithm
io eld
Ex. 4.8.1 : Given the following data, apply the Apriori algorithm. Min support = 50 % Database D.
TID Items
ic ow
100 134
n
200 235
bl kn
300 1235
at
Pu ch
400 25
Soln. :
Te
Step 1 : Scan D for count of each candidate. The candidate list is {1, 2, 3,4, 5} and find the support.
C1 =
Itemset Sup.
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
Step 2 : Compare candidate support count with minimum support count (i.e. 50%)
L1 =
Itemset Sup.
{1} 2
{2} 3
{3} 3
{5} 3
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-9 Association Rules Mining
Itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
e
Step 4 : Scan D for count of each candidate in C2 and find the support
g
io eld C2 =
Itemset Sup.
{1 2} 1
ic ow
{1 3} 2
n
{1 5} 1
bl kn
{2 3} 2
{2 5} 3
at
Pu ch
{3 5} 2
Te
Step 5 : Compare candidate (C2) support count with the minimum support count
L2=
Itemset Sup.
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
Itemset
{1,3,5}
{2,3,5}
{1,2,3}
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-10 Association Rules Mining
Itemset sup
{1,3,5} 1
{2,3,5} 2
{1,2,3} 1
Step 8 : Compare candidate (C3) support count with the minimum support count
L3 =
Itemset sup
{2,3,5} 2
e
Step 9 : So data contain the frequent itemset(2,3,5)
g
Therefore the association rule that can be generated from L3 are as shown below with the support and
io eld
confidence.
Association Rule Support Confidence Confidence %
ic ow
If the minimum confidence threshold is 70% (Given), then only the first and second rules above are output, since these
are the only ones generated that are strong.
Ex. 4.8.2 : Find the frequent item sets in the following database of nine transactions, with a minimum support 50% and
confidence 50%.
Transaction ID Items Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
Soln. :
Step 1 : Scan D for count of each candidate. The candidate list is {A,B,C,D,E,F} and find the support
C1=
Items Sup.
{A} 3
{B} 2
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-11 Association Rules Mining
Items Sup.
{C} 2
{D} 1
{E} 1
{F} 1
Step 2 : Compare candidate support count with minimum support count (50%)
L1=
Items Sup.
{A} 3
{B} 2
{C} 2
g e
Step 3 : Generate candidate C2 from L1
io eldC2=
Items
{A,B}
ic ow
{A,C}
n
{B,C}
bl kn
Step 4 : Scan D for count of each candidate in C2 and find the support
at
C2=
Pu ch
Items Sup.
{A,B} 1
Te
{A,C} 2
{B,C} 1
Step 5 : Compare candidate (C2) support count with the minimum support count
L2 =
Items Sup.
{A,C} 2
Minimum confidence threshold is 50% (Given), then both the rules are output as the confidence is above 50 %.
So final rules are :
Rule 1 : A - > C
Rule 2 : C- > A
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-12 Association Rules Mining
Ex. 4.8.3 : Consider the transaction database given below. Use Apriori algorithm with minimum support count 2. Generate
the association rules along with its confidence.
T200 I2, I4
T300 I2, I3
T500 I1, I3
T600 I2, I3
T700 I1, I3
e
T800 I1, I2, I3, I5
g
T900 I1, I2, I3
io eld
Soln. :
Step 1 : Scan the transaction Database D and find the count for item-1 set which is the candidate. The candidate list is
ic ow
{I1, I2, I3, I4, I5} and find each candidates support.
C1=
n
1-Itemsets Sup-count
bl kn
I1 6
I2 7
at
Pu ch
I3 6
I4 2
Te
I5 2
Step 2 : Find out whether each candidate item is present in at least two transactions (As support count given is 2).
L1=
1-Itemsets Sup-count
1 6
2 7
3 6
4 2
5 2
Step 3 : Generate candidate C2 from L1 and find the support of 2-itemsets.
C2=
2-Itemsets Sup-count
1,2 4
1,3 4
1,4 1
1,5 2
2,3 4
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-13 Association Rules Mining
2-Itemsets Sup-count
2,4 2
2,5 2
3,4 0
3,5 1
4,5 0
Step 4 : Compare candidate (C2) generated in step 3 with the support count, and prune those itemsets which do not
satisfy the minimum support count.
L2 =
Frequent Sup-count
2-Itemsets
1,2 4
e
1,3 4
g
1,5 2
io eld
2,3 4
2,4 2
ic ow
2,5 2
Step 5 : Generate candidate C3 from L2 .
n
bl kn
C3 =
Frequent 3-Itemset
at
Pu ch
1,2,3
1,2,5
Te
1,2,4
Step 6 : Scan D for count of each candidate in C3 and find their support count.
C3 =
Step 7 : Compare candidate (C3) support count with the minimum support count and prune those itemsets which do
not satisfy the minimum support count.
L3 =
Let us consider the frequent itemsets ={I1, I2, I5}. Following are the Association rules that can be generated
shown below with the support and confidence.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-14 Association Rules Mining
Suppose if the minimum confidence threshold is 75% then only the following rules will be considered as output, as
they are strong rules.
Rules Confidence
e
I1^I5=>I2 100%
g
io eld I2^I5=>I1 100%
I5=>I1^I2 100%
TID Items
01 1, 3, 4, 6
n
bl kn
02 2,3,5,7
03 1,2,3,5,8
at
04 2,5,9,10
Pu ch
05 1,4
Te
Apply the Apriori with minimum support of 30% and minimum confidence of 75% and find large item set L.
Soln. :
Step 1 : Scan the transaction Database D and find the count for item-1 set which is the candidate. The candidate list is
{1,2,3,4,5,6,7,8,9,10} and find the support.
C1 =
Itemset Sup-count
1 3
2 3
3 3
4 2
5 3
6 1
7 1
8 1
9 1
10 1
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-15 Association Rules Mining
Step 2 : Find out whether each candidate item is present in at least 30% of transactions (As support count given is 30%).
L1 =
Itemset Sup-count
1 3
2 3
3 3
4 2
5 3
e
1,4 2
g
1,5 1
io eld
2,3 2
2,4 0
2,5 3
ic ow
3,4 1
3,5 2
n
bl kn
Step 4 : Compare candidate (C2) generated in step 3 with the support count, and prune those itemsets which do not
satisfy the minimum support count.
at
Pu ch
L2 =
Itemset Sup-count
Te
1,3 2
1,4 2
2,3 2
2,5 3
Itemset Sup-count
2,3,5 2
Following are the association rules that can be generated from L3 are as shown below with the support and
confidence.
Association Rule Support Confidence Confidence %
2^3=>5 2 2/2=1 100%
3^5=>2 2 2/2=1 100%
2^5=>3 2 2/3=0.66 66%
2=>3^5 2 2/3=0.66 66%
3=>2^5 2 2/3=0.66 66%
5=>2^3 2 2/3=0.66 66%
Given minimum confidence threshold is 75%, so only the first and second rules above are output, since these are the
only ones generated that are strong.
e
Final Rules are :
g
Rule 1: 2^3=>5 and Rule 2 : 3^5=>2
io eld
Ex. 4.8.5 : A database has four transactions. Let min sup=60% and min conf= 80%
Step 2 : Compare candidate support count with minimum support count (i.e. 60%).
L1 =
Itemset Sup-count
A 4
B 4
D 3
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-17 Association Rules Mining
Itemset
A,B
A,D
B,D
Step 4 : Scan D for count of each candidate in C2 and find the support.
C2 =
Itemset Sup-count
A,B 4
A,D 3
e
B,D 3
g
Step 5 : Compare candidate (C2) support count with the minimum support count.
io eld
L2 =
Itemset Sup-count
ic ow
A,B 4
A,D 3
n
B,D 3
bl kn
C3 =
Itemset
A,B,D
Te
Step 8 : Compare candidate (C3) support count with the minimum support count.
L3 =
Itemset Sup
A,B,D 3
If the minimum confidence threshold is 80% (Given), then only the SECOND, THIRD AND LAST rules above are output,
since these are the only ones generated that are strong.
Ex. 4.8.6 : Apply the Apriori algorithm on the following data with Minimum support = 2
T100 I1,I2,I4
T200 I1,I2,I5
e
T300 I1,I3,I5
g
T400 I2,I4
io eld
T500 I2,I3
T600 I1,I2,I3,I5
ic ow
T700 I1,I3
n
T800 I1,I2,I3
bl kn
T900 I2,I3
T1000 I3,I5
at
Pu ch
Soln. :
Step 1 : Scan Dfor count of each candidate. The candidate list is {I1, I2, I3, I4, I5} and find the support.
Te
C1 =
I-Itemsets Sup-count
I1 6
I2 7
I3 7
I4 2
I5 4
Step 2 : Compare candidate support count with minimum support count (i.e. 2).
L1 =
I-Itemsets Sup-count
1 6
2 7
3 6
4 2
5 2
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-19 Association Rules Mining
e
4,5 0
g
Step 4 : Compare candidate (C2) support count with the minimum support count.
io eld
L2 =
2-Itemsets Sup-count
ic ow
1,2 4
1,3 4
n
1,5 3
bl kn
2,3 4
at
2,4 2
Pu ch
2,5 2
3,5 3
Te
Step 7 : Compare candidate (C3) support count with the minimum support count.
L3 =
Frequent 3-Itemset Sup-count
1,2,3 2
1,2,5 2
1,3,5 2
Step 8 : So data contain the frequent itemsets are {I1,I2,I3} and {I1,I2,I5} and {I1,I3,I5}.
Let us assume that the data contains the frequent itemset = {I1,I2,I5} then the association rules that can be
generated from frequent itemset are as shown below with the support and confidence.
e
I1^I5=>I2 2 2/2 100%
g
I2^I5=>I1 2 2/2 100%
io eld
I1=>I2^I5 2 2/6 33%
I2=>I1^I5 2 2/7 29%
ic ow
since these are the only ones generated that are strong.
Similarly do for frequent itemset {I1,I2,I3} and {I1,I3,I5}.
at
Pu ch
Ex. 4.8.7 : A Database has four transactions. Let Minimum support and confidence be 50%.
Te
Tid Items
100 1, 3, 4
200 2, 3, 5
300 1, 2, 3, 5
400 2, 5
500 1,2,3
600 3,5
700 1,2,3,5
800 1,5
900 1,3
Soln. :
Step 1 : Scan D for count of each candidate. The candidate list is {1,2,3,4,5}and find the support.
C1 =
Itemset Sup-count
1 6
2 5
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-21 Association Rules Mining
Itemset Sup-count
3 7
4 1
5 6
Step 2 : Compare candidate support count with minimum support count (i.e. 50%).
L1 =
Itemset Sup-count
1 6
2 5
3 7
e
5 6
g
Step 3 : Generate candidate C2 from L1and find the support.
io eld
C2 =
Itemset Sup-count
ic ow
1,2 3
n
1,3 5
bl kn
1,5 3
at
Pu ch
2,3 4
2,5 4
Te
3,5 4
Step 4 : Compare candidate (C2) support count with the minimum support count.
L2 =
Itemset Sup-count
1,3 5
Given minimum confidence threshold is 50% , so both the rules are strong.
Final rules are :
Rule 1: 1=>3 and Rule 2 : 3=>1
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-22 Association Rules Mining
Ex. 4.8.8 : Consider the five transactions given below. If minimum support is 30% and minimum confidence is 80%,
determine the frequent itemsets and association rules using the a priori algorithm.
Transaction items
T1 Bread, Jelly, Butter
T2 Bread, Butter
T3 Bread, Milk, Butter
T4 Coke, Bread
T5 Coke, Milk
Soln. :
Step 1 : Scan D for Count of each candidate.
The candidate list is {Bread, Jelly, Butter, Milk, Coke}
e
C1 =
g
I-Itemlist Sup-Count
io eld
Bread 4
Jelly 1
Butter 3
ic ow
Milk F2
n
Coke 2
bl kn
Step 2 : Compare candidate support count with minimum support count (i.e. 2)
at
Pu ch
I-Itemlist Sup-Count
Bread 4
Te
Butter 3
Milk 2
Coke 2
Step 4 : Compare candidate (C2) support count with the minimum support count
L2 =
Frequent 2 - Itemset Sup – Count
{Bread, Butter} 3
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-23 Association Rules Mining
TID Items
01 A, B, C, D
02 A, B, C, D, E, G
e
03 A, C, G, H, K
g 04 B, C, D, E, K
io eld
05 D, E, F, H, L
06 A, B, C, D, L
ic ow
07 B, I, E, K, L
n
08 A, B, D, E, K
bl kn
09 A, E, F, H, L
10 B, C, D, F
at
Pu ch
Apply the Apriori algorithm with minimum support of 30% and minimum confidence of 70%, and find all the
association rules in the data set.
Te
Soln. :
Step 1 : Generate single item set :
e
BC 5 EF 2 CD 5
BD
g 6 EH 2 DE 4
io eld
BE 4 EK 3 EK 3
BF 1 EL 3 EL 3
ic ow
BH 0 FH 2
n
BK 3 FK 0
bl kn
BL 2 FL 2
CD 5 HK 1
at
Pu ch
CE 2 HL 2
CF 1 KL 1
Te
e
ACD 3
g
BCD 5
io eld
BDE 3
ABDE 2
BCDE 2
at
Pu ch
Therefore ABCD is the large item set with minimum support 30%.
Te
From the above Rules generated, only the rules having greater than 70% are considered as final rules. So final Rules
are,
AB → CD
AC → BD
AD → BC
ACD → B
ABD → C
ABC → D
Transaction Items
e
t1 Bread, Jelly, Peanut Butter
g
io eld t2 Bread, Peanut Butter
t3 Bread, Milk, Peanut Butter
t4 Beer, Bread
t5 Beer, Milk
ic ow
Calculate the support and confidence for the following association rules :
n
i) Bread → Peanut Butter
bl kn
Soln. :
Consider Minimum support count = 2 and Minimum confidence = 80%
Te
Step 2 : Compare candidate support count with minimum support count (i.e. 2)
I-Itemlist Sup-Count
Bread 4
Peanut Butter 3
Milk 2
Beer 2
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-27 Association Rules Mining
Step 4 : Compare candidate (C2) support count with the minimum support count
e
L2 =
g
io eld Frequent 2 - Itemset Sup – Count
Step 1 : Scan D for count of each candidate. The candidate list is {A,B,C,D,F,M} and find the support.
C1 =
Itemset Sup-count
A 4
B 4
C 2
D 3
F 2
M 1
Step 2 : Compare candidate support count with minimum support count (i.e. 50%).
L1 =
Itemset Sup-count
A 4
e
B 4
g
C 2
io eld
D 3
F 2
ic ow
A,B
at
A,C
Pu ch
A,D
A,F
Te
B,C
B,D
B,F
C,D
C,F
D,F
Step 4 : Scan D for count of each candidate in C2 and find the frequent itemset.
L2 =
Itemset Sup-count
A,B 4
A,C 2
A,D 3
A,F 2
B,C 2
B,D 2
B,F 2
C,F 2
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-29 Association Rules Mining
Itemset
A,B,C
A,B,D
A,B,F
A,D,F
B,C,D
B,C,F
B,D,F
g e
io eld A,C,F
Step 6 : Compare candidate (C3) support count with the minimum support count.
L3 =
ic ow
Itemset Sup-count
n
A,B,C 2
bl kn
A,B,D 3
at
Pu ch
A,B,F 2
B,C,F 2
Te
A,C,F 2
Itemset
A,B,C,D
A,B,C,F
Step 8 : Compare candidate (C4) support count with the minimum support count.
L4 =
Itemset Sup-count
A,B,C,D 1
A,B,C,F 2
Ex. 4.8.12 : A database has five transactions. Let minimum support is 60%.
TID Items
1 Butter, Milk
e
3 Milk, Dates, Balloon, Cake
g
4 Butter, Milk, Dates, Balloon
io eld
5 Butter, Milk, Dates, Cake
Find all the frequent item sets using Apriori algorithm. Show each step. (SPPU - Oct. 16, 6 Marks)
ic ow
Soln. :
Step 1 : Scan database for count of each candidate. The candidate list is {Butter, milk, Dates, Balloon, Eggs, cake} and
n
bl kn
Itemset Support
{ Butter } 4
Te
{ Milk } 4
{ Dates } 4
{ Balloon } 3
{ Eggs } 1
{ Cake } 2
Step 2 : Compare candidate support count with minimum support (i.e. 60%)
L1 =
Itemset Support
{ Butter } 4
{ Milk } 4
{ Dates } 4
{ Balloon } 3
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-31 Association Rules Mining
Itemset
{ Butter, Milk }
{ Butter, Dates}
{Butter, Balloon }
{Milk, Dates}
{ Milk, Balloon }
{Dates, Balloon }
e
Step 4 : Scan D for count of each candidate to find the support C2.
g
io eld Itemset Support
{ Butter, Milk } 3
{ Butter, Dates} 3
ic ow
{Butter, Balloon } 2
n
{Milk, Dates} 3
bl kn
{ Milk, Balloon } 2
{Dates, Balloon } 3
at
Pu ch
Itemset Support
{ Butter, Milk } 3
{ Butter, Dates} 3
{Milk, Dates} 3
{ Dates, Balloon } 3
e
Itemset Sup-count
g
A 4
io eld
B 4
C 2
D 3
ic ow
F 2
n
M 1
bl kn
Step 2 : Compare candidate support count with minimum support count (i.e. 50%).
at
L1 =
Pu ch
Itemset Sup-count
A 4
Te
B 4
C 2
D 3
F 2
Step 4 : Scan D for count of each candidate in C2 and find the frequent itemset.
L2 =
Itemset Sup-count
A,B 4
A,C 2
A,D 3
A,F 2
B,C 2
B,D 2
B,F 2
C,F 2
e
Step 5 : Generate candidate C3 from L2.
g
C3 =
io eld
Itemset
A,B,C
ic ow
A,B,D
A,B,F
n
A,D,F
bl kn
B,C,D
at
B,C,F
Pu ch
B,D,F
A,C,F
Te
Step 6 : Compare candidate (C3) support count with the minimum support count.
L3 =
Itemset Sup-count
A,B,C 2
A,B,D 3
A,B,F 2
B,C,F 2
A,C,F 2
Itemset
A,B,C,D
A,B,C,F
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-34 Association Rules Mining
Step 8 : Compare candidate (C4) support count with the minimum support count.
L4 =
Itemset Sup-count
A,B,C,D 1
A,B,C,F 2
e
B,C,F A 2 2/2 100
g
A,B,F C
io eld 2 2/2 100
4.9 Mining Frequent Item sets without Candidate Generation : FP Growth Algorithm
Definition of FP-tree
ic ow
− A set of item prefix sub-trees with each node formed by three fields : item-name, count, node-link.
at
− A frequent-item header table with two fields for each entry : item-name, head of node-link.
Pu ch
− The size of the FP-tree is bounded by the size of the database, but due to frequent items sharing, the size of the tree is
usually much smaller than its original database.
− High compaction is achieved by placing more frequently items closer to the root (being thus more likely to be shared).
− The FP-Tree contains everything from the database we need to know for mining frequent.
Patterns
− The size of the FP-tree is ≤ the candidate sets generated in the association rule mining.
− This approach is very efficient due to :
o Compression of a large database into a smaller data structure.
o It is a frequent pattern growth mining method or simply FP-growth.
o It adopts a divide-and-conquer strategy.
− The database of frequent items is compressed into a FP-Tree, and the association information of items is preserved.
− Then mine each such database separately.
− Algorithm : FP growth, Mine frequent itemsets using an FP-tree by pattern fragment growth.
Input
− D, a transaction database.
− min_sup, the minimum support count threshold.
Method
e
1. A FP tree is constructed in the following steps
g
(a) Scan the transaction database D once, Collect F, the set of frequent items, and their support counts. Sort F by support
io eld
count in descending order as L, the list of frequent items.
(b) Create the root of an FP tree, and label it as “null”. For each transaction Trans D do the following.
Select and sort the frequent items in Trans according to the order of L. Let the sorted frequent item list in Trans be [p
ic ow
| P], where p is the first element and P is the remaining list. Call insert_tree ([p | P]⋅T), which is performed as follows.
n
If T has a child N such that N_item⋅name = p⋅item⋅name, then increment N’s count by 1; else create a new node N, and
bl kn
let its count be 1, its parent link be linked to T, and its node-link to the nodes with the same item⋅name via the
node⋅link structure. If P is nonempty, call insert_tree (P, N) recursively.
at
Pu ch
2. The FP-tree is mined by calling FP growth. FP tree, null/, which is implemented as follows :
Analysis
− Two scans of the DB are necessary. The first collects the set of frequent items and the second constructs the FP-tree.
− The cost of inserting a transaction Trans into the FP-tree is O(|Trans|), where |Trans| is the number of frequent items
in Trans.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-36 Association Rules Mining
− Many transactions share items due to which the size of the FP-Tree can have a smaller size compared to
uncompressed data.
− Best case scenario : All transactions have the same set of items which results in a single path in the FP Tree.
− Worst case scenario : Every transaction has a distinct set of items, i.e. no common items
o FP-tree size is as large as the original data.
o FP-Tree storage is also higher, it needs to store the pointers between the nodes and the counter.
− FP-Tree size is dependent on the order of the items. Ordering of items by decreasing support will not always result in a
smaller FP -Tree size (it’s heuristic).
e
Ex. 4.9.1 : Transactions consist of a set of items I = {a, b, c, ...} , min support = 3
g
TID Items Bought
io eld
1 f, a, c, d, g, i, m, p
2 a, b, c, f, l, m, o
3 b, f, h, j, o
ic ow
4 b, c, k, s, p
5 a, f, c, e, l, p, m, n
n
bl kn
Soln. :
at
Item Sup.
Te
a 3
b 3
c 4
d 1
e 1
f 4
g 1
h 1
i 1
j 1
k 1
l 2
m 3
n 1
o 2
p 3
Item Sup.
a 3
b 3
c 4
f 4
m 3
p 3
Step 2 : Order all items in itemset in frequency descending order (min support = 3) (Note : Consider only items with min
support = 3)
e
2 a, b, c, f, l, m, o f, c, a, b, m
g
3 b, f, h, j, o f, b
io eld
4 b, c, k, s, p c, b, p
5 a, f, c, e, l, p, m, n f, c, a, m, p
ic ow
Originally Empty
(ii) Consider the first item in the second transaction i.e. f and add it in the tree.
After this step we get f:2, finished adding f in the above tree.
(iii) Now consider the second item in the above transaction .i.e. c.
g e
io eld
ic ow
n
bl kn
(v) Since we do not have a node b, we create one node for b below the node a (note : to maintain the path).
(vi) Now only m of second transaction is left. Though a node m is already exists still we can’t increase its count of the
existing node m as we need to represent the second transaction in FP tree, so add new node m below node b and link
it with existing node m.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-39 Association Rules Mining
Step 6 : Similarly insert the third transaction(f,b) as explained in step 5. So After the insertion of third transaction (f,b)
g e
io eld
ic ow
n
bl kn
Ex. 4.9.2 : A database has 6 transactions. Let minimum support = 60% and Minimum confidence = 70%
Transaction ID Items Bought
T1 {A, B, C, E}
T2 {A, C, D, E}
T3 {B, C, E}
T4 {A, C, D, E}
T5 {C, D, E}
T6 {A, D, E}
i) Find Closed frequent Itemsets
ii) Find Maximal frequent itemsets
iii) Design FP Tree using FP growth algorithm (SPPU - Dec. 18, 8 Marks)
Soln. :
e
− An itemset is closed if none of its immediate supersets has the same support as the itemset.
−
g
An itemset is maximal frequent if none of its immediate supersets is frequent.
io eld
Itemsets
ic ow
n
bl kn
at
Pu ch
Te
Method
− For each item, conditional pattern-base is constructed, and then it’s conditional FP-tree.
− On each newly created conditional FP-tree, repeat the process.
− The process is repeated until the resulting FP-tree is empty, or it has only a single path(All the combinations of sub
paths will be generated through that single path, each of which is a frequent pattern).
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-41 Association Rules Mining
Example : Finding all the patterns with ‘p’ in the FP tree given below :
g e
io eld
ic ow
n
− Following are the paths with ‘P’
bl kn
o We got (f:4, c:3, a:3, m:2, p:2) and (c:1, b:1, p:1)
o
Pu ch
o Therefore we have (f:2, c:2, a:2, m:2, p:2) and (c:1, b:1, p:1)
Te
o Find all frequent patterns in the CPB and add ’p’ to them, this will give us all frequent patterns containing ‘p’.
o This can be done by constructing a new FP-Tree for the CPB.
− Finding all patterns with ‘P’.
o We again filter away all items < minimum support threshold ( i.e. 3)
o We generate (cp:3) (Note : we are finding frequent patterns containing item p, so we append p to c as c is only
item that has min support threshold.)
e
− Build FP tree using (f:2, c:2, a:2) and (f:1, c:1, a:1, b:1)
g
− Now we got ( f:3, c:3, a:3 ,b:1)
io eld
− Initial Filtering removes b:1 (We again filter away all items < minimum support threshold).
− Mining Frequent Patterns by Creating Conditional Pattern-Bases.
ic ow
f Empty Empty
T200 I2, I4
T300 I2, I3
T500 I1, I3
T600 I2, I3
T700 I1, I3
e
T900 I1, I2, I3
g
Min support = 2
io eld
Soln. :
ic ow
n
bl kn
at
Pu ch
Te
Ex. 4.9.4 : Consider the following dataset of frequent itemsets. All are sorted according to their support count. Construct
the FP-Tree and find Conditional Pattern base for D.
TID Items
1 {A, B}
2 {B, C, D}
3 {A, C, D, E}
4 {A, D, E}
5 {A, B, C}
6 {A, B, C, D}
7 {B, C}
8 {A, B, C}
e
9 {A, B, D}
g
io eld 10 {B, C, E}
Soln. :
ic ow
n
bl kn
at
Pu ch
Te
− Support count of D = 1.
e
− Again filter away all items < minimum support threshold
g
( i.e.1 as Support of D =1)
io eld
− Consider First Branch
{(A:1,B:1,C:1),(A:1,B:1),
ic ow
We generate ABCD:1
at
Pu ch
So append BC with D
We generate BCD : 1
− So Frequent Itemsets found (with sup > 1): AD, BD, CD, ACD, BCD which are generated from CPB on conditional
node D.
Completeness
− The method can mine short as well as long frequent patterns and it is highly efficient.
− FP-Growth algorithm is much faster than Apriori Algorithm.
e
Q. Explain the following terms : Multilevel association rules. (May 16, 3 Marks)
g
Q. Explain with example Multi level association Rule mining. (Dec. 18, 2.5 Marks)
io eld
− Items are always in the form of hierarchy.
− An item can be either generalized or specialized as per the described hierarchy of that item and its levels can be
n
powerfully preset in transactions.
bl kn
− Rules which combine associations with hierarchy of concepts are called Multilevel Association Rules.
at
Pu ch
Te
− The support and confidence of an item is affected due to its generalization or specialization value of attributes.
− The support of generalized item is more than the support of specialized item
− As only one minimum support is set, so there is no necessity to examine the items of itemset whose ancestors do
not have minimum support.
e
− If very high support is considered then many low level association may get missed.
−
g
If very low support is considered then many high level association rules are generated.
io eld
ic ow
n
bl kn
− As every level is having its own minimum support, the support at lower level reduces.
− The parent node is checked whether it’s frequent or not frequent and based on that node is examined.
e
− Some minimum support threshold is set for lower level.
g
−
io eld
So the items which do not satisfy minimum support are checked for minimum support threshold this is also called
“Level Passage Threshold”.
ic ow
Q. Explain the following terms : Constraints-based rule mining. (Dec. 16, 3 Marks)
Q. Explain with example Constraint based association Rule mining. (Dec. 18, 2.5 Marks)
at
Pu ch
Forms of constraints
Te
− Specifies the syntactic form of the rules in which we are interested. Syntactic forms serves as the constraint.
− Data mining system looks for the patterns which matches the given metarules. For example if two predicates Age and
Salary are given to analyse whether the customer buys “Apple Product”
age (C, “30..40”) Λ Salary (C, “30K..50K”) →buys (C, “Apple Product”)
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-49 Association Rules Mining
e
Multilevel associations
−
g
Items are always in the form of hierarchy.
io eld
− tems which are at leaf nodes are having lower support.
ic ow
n
bl kn
at
Pu ch
Te
− An item can be either generalized or specialized as per the described hierarchy of that item and its levels can be
powerfully preset in transactions.
− Rules which combine associations with hierarchy of concepts are called Multilevel Association Rules.
− The support and confidence of an item is affected due to its generalization or specialization value of attributes.
− The support of generalized item is more than the support of specialized item
− Similarly the support of rules increases from specialized to generalized itemsets.
− If the support is below the threshold value then that rule becomes invalid
− Confidence is not affected for general or specialized.
Multidimensional associations
− Single-dimensional rules : The rule contains only one distinct predicate. In the following example the rule has only one
predicate “buys”.
buys(X, “Butter”) ⇒ buys(X, “Milk”)
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 4-50 Association Rules Mining
− Categorical attributes : This have finite number of possible values and there is no ordering among values. Example :
brand, color.
− Quantitative attributes : These are numeric values and there is implicit ordering among values. Example : age,
income.
e
Pattern-pruning data-pruning
g
If we can prune a frequent pattern P after If we can prune Graph G from the data space search of P after
checking constraints on it, then the entire subtree data pruning checking G, will be pruned from the data search
io eld
rooted at P in the pattern tree model will not be grown. space of all nodes in the subtree rooted at P.
Pattern-pruning should be performed We would perform data pruning checking for
when Tc(P) <= p.Tp G if Td (P,G) < q.Tp
ic ow
n
Review Questions
bl kn
Unit V
Syllabus
Introduction to : Classification and Regression for Predictive Analysis, Decision Tree Induction, Rule-Based
Classification : using IF-THEN Rules for Classification, Rule Induction Using a Sequential Covering Algorithm.
Bayesian Belief Networks, Training Bayesian Belief Networks, Classification Using Frequent Patterns, Associative
e
Classification, Lazy Learners-k-Nearest-Neighbor Classifiers, Case-Based Reasoning.
g
io eld
5.1 Introduction to : Classification and Regression for Predictive Analysis
− Classification constructs the classification model based on training data set and using that model classifies the new
ic ow
data.
− It predicts the value of classifying attribute or class label.
n
bl kn
Typical applications
1. Regression
2. Decision trees
3. Rules
4. Neural networks
Q. Explain the training and testing phase using Decision Tree in detail. Support your answer with relevant example.
(Dec. 18, 8 Marks)
1. Model construction
− Those set of sample tuples or subset data set is known as training data set.
− The constructed model based on training data set is represented as classification rules, decision trees or
mathematical formulae.
2. Model usage
− For classifying unknown objects or new tuple use the constructed model.
− Compare the class label of test sample with the resultant class label.
− Estimate accuracy of the model by calculating the percentage of test set samples that are correctly classified by
the model constructed.
e
− Test sample data and training data samples are always different, otherwise over-fitting will occur.
g
Example
io eld
Classification process : (1) Model construction
ic ow
n
bl kn
at
Pu ch
Te
Fig. 5.1.3 : Classification : Test data are used to estimate the accuracy of the classification rule
For example
How to perform classification task for classification of medical patients by their disease ?
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-3 Classification
g e
io eld
ic ow
n
bl kn
at
Pu ch
Te
Fig. 5.1.4
Data preparation
− Data cleaning : Pre-process data in order to reduce noise and handle missing values.
− Relevance analysis (feature selection) : Remove the irrelevant or redundant attributes.
− Data transformation : Generalize the data to higher level concepts using concept hierarchies and/or normalize data
which involves scaling the values.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-4 Classification
e
1. Predictive accuracy
g
io eld
This refers the ability of the model to correctly predict the class label of new or previously unseen data.
3. Robustness
at
Pu ch
4. Interpretability
Te
5. Goodness of rules
5.1.4 Regression
− Suppose an employee needs to predict how much rise he will get in his salary after 5 years, means he bother to
predict the numeric value. In this case a model is constructed based on his previous salary values that predicts a
continuous-valued function or ordered value.
− Prediction is generally about the future values or the unknown events and it models continuous-valued functions.
− Most commonly used methods for prediction is regression.
− It determines the strength of relationship between one dependent variable with the other independent variable using
some statistical measure.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-5 Classification
Linear regression : Y = m + nX + u
Multiple regression : Y = m + n1X1 + n2X2 + n3X3 + ... + nt Xt + u
e
Where :
g
Y
io eld
= The dependent variable which we are
trying to predict
X = The independent variable that we are
ic ow
− Regression uses a group of random variables for prediction and finds a mathematical relationship between them. This
Te
relationship is depicted in the form of a straight line (linear regression) that approximates all the points in the best
way.
− Regression may be used to determine for e.g. price of a commodity, interest rates, the price movement of an asset
influenced by industries or sectors.
Linear Regression
Regression tries to find the mathematical relationship between variables, if it is a straight line then it is a linear model
and if it gives a curved line then it is a non linear model.
− The relationship between dependent and independent variable is described by straight line and it has only one
independent variable.
Y = α+ βX
− Two parameters, α and β specify the (Y-intercept and slope of the) line and are to be estimated by using the data
at hand.
− The value of Y increases or decreases in a linear manner as the value of X changes accordingly.
− Draw a line relating to Y and X which is well fitted to given data set.
− The idea situation is that if the line which is well fitted for all the data points and no error for prediction.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-6 Classification
− If there is random variation of data points, which are not fitted in a line then construct a probabilistic model
related to X and Y.
− Simple linear regression model assumes that data points deviates about the line, as shown in the Fig. 5.1.7.
g e
Fig. 5.1.7 : Linear regression
io eld
(B) Multiple Linear Regression
− Multiple linear regression is an extension of simple linear regression analysis .
ic ow
− It uses two or more independent variables to predict the outcome and a single continuous dependent variable.
n
Y = a0 + a1 X1 + a2 X2+ .... + ak X k + e
bl kn
− In log linear regression a best fit between the data and a log linear model is found.
− Major assumption : A linear relationship exists between the log of the dependent and independent variables.
− Loglinear models are models that postulate a linear relationship between the independent variables and the logarithm
of the dependent variable, for example :
log(y) = a0 + a1 x1 + a2 x2 ... + aN xN
where y is the dependent variable; xi, i=1,...,N are independent variables and {ai, i=0,...,N} are parameters
(coefficients) of the model.
− For example, log linear models are widely used to analyze categorical data represented as a contingency table. In this
case, the main reason to transform frequencies (counts) or probabilities to their log-values is that, provided the
independent variables are not correlated with each other, the relationship between the new transformed dependent
variable and the independent variables is a linear (additive) one.
− Training dataset should be class-labeled for learning of decision trees in decision tree induction.
− A decision tree represents rules and it is very a popular tool for classification and prediction.
− Rules are easy to understand and can be directly used in SQL to retrieve the records from database.
− To recognize and approve the discovered knowledge got from decision model is very crucial task.
− There are a many algorithms to build decision trees :
o ID3 (Iterative Dichotomiser 3)
o C4.5 (Successor of ID3)
o CART (Classification And Regression Tree)
o CHAID (CHi-squared Automatic Interaction Detector
e
Decision tree learning is appropriate for the problems having the characteristics given below :
g
− Instances are represented by a fixed set of attributes (e.g. gender) and their values (e.g. male, female) described as
io eld
attribute-value pairs.
− If the attribute has small number of disjoint possible values (e.g. high, medium, low) or there are only two possible
classes (e.g. true, false) then decision tree learning is easy.
ic ow
− Extension to decision tree algorithm also handles real value attributes (e.g. salary).
n
− Decision tree gives a class label to each instance of dataset.
bl kn
− Decision tree methods can be used even when some training examples have unknown values (e.g. humidity is known
for only a fraction of the examples).
at
Pu ch
− Learned functions are either represented by a decision tree or re-represented as sets of if-then rules to improve
readability.
Te
Decision tree classifier has tree type structure which has leaf nodes and decision nodes.
− A leaf node is the last node of each branch and indicates class label or value of target attribute.
− A decision node is the node of tree which has leaf node or sub-tree. Some test to be carried on the each value of
decision node to get the decision of class label or to get next sub-tree.
e
Q. Write a pseudo code for the construction of Decision Tree State and justify its time complexity also.
(Dec. 15, 4 Marks)
g
The Basic ideas behind ID3 :
io eld
− C4.5 is an extension of ID3.
− C4.5 accounts for unavailable values, continuous attribute value ranges, pruning of decision trees, rule derivation and
ic ow
so on.
n
− C4.5 is a designed by Quinlan to address the following issues not given by ID3 :
bl kn
o It gives better the efficiency of computation. The algorithm to generate decision tree is given by Jiawei Han et al.
as below :
Algorithm : Generate_decision_tree : Generate a decision tree from the training tuples of data partition, D.
Input
− Data partition, D, which is a set of training tuples and their associated class labels;
Output
A decision tree.
Method
1. create a node N;
2. if tuples in D are all of the same class, C, then
e
13. attach a leaf labeled with the majority class in D to node N;
g
14. else attach the node returned by Generate_decision_tree (Dj, attribute_list) to node N; endfor
io eld
15. return N;
− Time complexity : For a normal style decision tree such as C4.5 the time complexity is O(N D^2), where D is the
number of features. A single level decision tree would be O(N D)
ic ow
− Because of noise or outliers, the generated tree may overfit due to many branches.
at
Pu ch
Prepruning
Te
Postpruning
− Build the full tree then start pruning, remove the branches.
− Use different set of data than training data set to get the best pruned tree.
Ex. 5.2.1 : Apply ID3 on the following training dataset from all electronics customer database and extract the classification
rule from the tree.
e
>40 Medium No Excellent No
g
Soln. :
io eld
Class P : buys_computer = “yes”
p p n n
= – p + n log2 p + n – p + n log2 p + n
Te
So, the expected information needed to classify a given sample if the samples are partitioned according to age is,
Calculate entropy using the values from the above table and the formula given below :
v
pi + ni
E (A) = ∑ p + n I (pi, ni)
i=1
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-11 Classification
5 4 5
E (age) = 14 I(2, 3) + 14 I(4, 0) + 14 I(3, 2) = 0.694
Since Age has three possible values, the root node has three branches (<=30, 31…40, >40).
g e
Step 2 :
io eld
The next question is “what attribute should be tested at the Age branch node?” Since we have used Age at the root,
now we have to decide on the remaining three attributes:income, student, or credit_rating.
ic ow
Consider Age : <= 30 and count the number of tuples from the original given training set
n
S<=30 = 5 ( Age: <=30 )
bl kn
Calculate entropy using the values from the above table and the formula given as :
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-12 Classification
v
pi + ni
E (A) = ∑ p + n I (pi, ni)
i=1
E (Income) = 2/5 * I (0, 2) + 2/5* I(1, 1) + 1/5 *I(1, 0) = 0.4
e
Therefore, I(pi , ni) = I(0,3) = – (0/3) log2(0/3) – (3/3) log2(3/3)= 0.
g
Similarly for different outlook ranges I(pi , ni) is calculated as given below :
io eld
Student pi ni I(pi, ni)
No 0 3 0
ic ow
Yes 2 0 0
Calculate Entropy using the values from the above table and the formula given below
n
bl kn
v
pi + ni
E (A) = ∑ p + n I (pi, ni)
at
i=1
Pu ch
Therefore
I(pi , ni ) = I(1, 2) = – (1/3) log2 (1/3) – (2/3) log2 (2/3)
= 0.918
Similarly for different outlook ranges I(pi , ni) is calculated as given below :
Credit_rating pi ni I(pi, ni)
Fair 1 2 0.918
Excellent 1 1 1
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-13 Classification
Calculate entropy using the values from the above table and the formula given below :
v
pi + ni
E (A) = ∑ p + n I (pi, ni)
i=1
Hence
e
Gain(S<=30, student) = 0.970
Gain(S<=30, income) = 0.570
Gain(S<=30, credit_rating) = 0.02
g
io eld
Student has the highest gain; therefore, it is below Age : “<=30”.
ic ow
n
bl kn
at
Pu ch
Fig. P. 5.2.1(a)
Te
Step 3 :
Consider now only income and credit rating for age : 31…40 and count the number of tuples from the original given
training set
S31…40 = 4 ( age : 31…40 )
Age Income Student Credit_rating Buys_computer
31…40 High No Fair Yes
31…40 Low Yes Excellent Yes
31…40 Medium No Excellent Yes
31…40 High Yes Fair Yes
Since for the attributes income and credit_rating, buys_computer = yes, so assign class ‘yes’ to 31…40
Fig. P. 5.2.1(b)
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-14 Classification
Step 4 :
Consider income and credit_rating for age: >40 and count the number of tuples from the original given training set
S>40 = ( age : > 40 )
Age Income Student Credit_rating Buys_computer
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
>40 Medium Yes Fair Yes
>40 Medium No Excellent No
Consider the above table as the new training set and calculate the Gain for income and credit_rating
Class P : buys_computer = “yes”
e
Class N: buys_computer = “no”
g
Total number of records 5.
io eld
Count the number of records with “yes” class and “no” class.
So number of records with “yes” class = 3 and “no” class = 2
ic ow
= 0.970
(iv) Compute the entropy for credit_rating
Te
Calculate entropy using the values from the above table and the formula given below
v
pi + ni
E (A) = ∑ p + n I (pi, ni)
i=1
3 2
E(Credit_rating) = 5 l (3,0) + 5 l (0,2 ) = 0
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-15 Classification
Hence,
= 0.970 – 0 = 0.970
Similarly for different outlook ranges I(pi , ni) is calculated as given below :
e
High 0 0 0
g
io eld Medium 2 1 0.918
Low 1 1 1
ic ow
Calculate Entropy using the values from the above table and the formula given below
n
E (Income) = 0/5 * I (0, 0) + 3/5* I(2, 1) + 2/5 *I(1, 1) = 0.951
bl kn
Therefore,
Gain(S>40,income) = 0.019
Gain(S>40,Credit_rating) = 0.970
Fig. P. 5.2.1(c)
Example
e
IF age = “<=30” AND student = “no”
g
THEN buys_computer = “no” io eld
IF age = “<=30” AND student = “yes”
THEN buys_computer = “yes”
ic ow
Ex. 5.2.2 : The weather attributes are outlook, temperature, humidity, and wind speed. They can have the following
Te
values :
Outlook = {sunny, overcast, rain}
temperature = {hot, mild, cool}
humidity = {high, normal}
wind = {weak, strong}
Sample data set S are :
Table P. 5.2.2 : Training data set for Play Tennis
We need to find which attribute will be the root node in our decision tree. The gain is calculated for all four
attributes using formula of gain (A).
Soln. :
Class P : Playball = “yes”
Class N : Playball = “no”
e
Total number of records 14.
g
Count the number of records with “yes” class and “no” class.
io eld
So number of records with “yes” class = 9 and “no” class = 5
p p n n
= – p + n log2 p + n – p + n log2p + n
n
bl kn
I (p, n) = I (9, 5)
= – (9/14) log2 (9/14) – (5/14) log2 (5/14)
at
Pu ch
Therefore ,
I(pi , ni) = I(2,3)
Similarly for different outlook ranges I(pi , ni) is calculated as given below :
Calculate entropy using the values from the above table and the formula given below :
v
pi + ni
E (A) = ∑ p + n I (pi, ni)
i=1
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-18 Classification
5 4 5
E (outlook) = 14 I (2, 3) + 14 I(4, 0) + 14 I(3, 2) = 0.694
e
As Outlook has only values “sunny, overcast, rain”, the root node has three branches
g
io eld
ic ow
Fig. P. 5.2.2(a)
n
Step 2 :
bl kn
As attribute outlook at root, we have to decide on the remaining three attribute for sunny branch node.
at
Consider outlook = Sunny and count the number of tuples from the original given training set
Pu ch
Therefore ,
I(pi , ni) = I(0,2) = – (0/2)log2(0/2) – (2/2)log2(2/2) = 0
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-19 Classification
Similarly for different outlook ranges I(pi , ni) is calculated as given below :
Temperature pi ni I(pi, ni)
Hot 0 2 0
Mild 1 1 1
Cool 1 0 0
Calculate Entropy using the values from the above table and the formula given below :
v
pi + ni
E (A) = ∑ p + n I (pi, ni)
i=1
= 0.4
e
Note : Tsunny is the total training set.
Hence
g
io eld
Gain(Tsunny, temperature) = I (p, n) – E (temperature)
ic ow
Therefore ,
I(pi , ni ) = I(0,3)
Te
Similarly for different outlook ranges I(pi , ni) is calculated as given below :
Calculate Entropy using the values from the above table and the formula given below
v
p i + ni
E (A) = ∑ p + n I(pi, ni)
i=1
E (Humidity) = 3/5 * I (0, 3) + 2/5* I(2, 0) = 0
Hence
= 0.971 – 0 = 0.971
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-20 Classification
Therefore,
Similarly for different outlook ranges I(pi , ni) is calculated as given follows :
e
Calculate Entropy using the values from the above table and the formula given as:
g
v
io eld p i + ni
E (A) = ∑ p + n I(p1, n1)
i=1
ic ow
Therefore,
Te
Fig. P. 5.2.2(b)
Step 3 :
Consider now only temperature and wind for outlook = Overcast and count the number of tuples from the original
given training set
Tovercast = {3,7,12,13}
Since for the attributes temperature and wind, playball = yes, so assign class ‘yes’ to overcast.
g e
io eld
Fig. P. 5.2.2(c)
Step 4 :
ic ow
Consider temperature and wind for outlook = Rain and count the number of tuples from the original given training set
n
Train = {4, 5, 6, 10, 14}
bl kn
Consider the above table as the new training set and calculate the Gain for temperature and Wind.
Calculate entropy using the values from the above table and the formula given below :
v
pi + n i
E (A) = ∑ p + n I (pi, ni)
i=1
3 2
e
E(Wind) = 5 I (3, 0) + 5 I (0, 2) = 0
g
HenceGain(Train, Wind) = I (p, n) – E (Wind)
io eld
= 0.970 – 0 = 0.970
(v) Compute the entropy for Temperature : (Hot, mild , cool)
ic ow
Similarly for different outlook ranges I(pi , ni) is calculated as given below :
Temperature pi ni I(pi, ni)
Te
Hot 0 0 0
Mild 2 1 0.918
Cool 1 1 1
Calculate Entropy using the values from the above table and the formula given below :
v
pi + n i
E (A) = ∑ p + n I (pi, ni)
i=1
= 0.951
Hence
Gain(TRain, temperature) = I (p, n) – E (temperature)
= 0.970 – 0.951 = 0.019
Therefore,
Gain(Train, Temperature) = 0.019
Gain(Train, Wind) = 0.970
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-23 Classification
Fig. P. 5.2.2(d)
g e
io eld
ic ow
n
Fig. P. 5.2.2(e) : Decision tree for play tennis
bl kn
Ex. 5.2.3 : A sample training dataset for stock market is given below. Profit is the class attribute and value is based on
age, contest and type.
Age Contest Type Profit
Old Yes Swr Down
Old No Swr Down
Old No Hwr Down
Mid Yes Swr Down
Mid Yes Hwr Down
Mid No Hwr Up
Mid No Swr Up
New Yes Swr Up
New No Hwr Up
New No Swr Up
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-24 Classification
e
Fig. P. 5.2.3
g
Ex. 5.2.4 :
io eld
Using the following training data set. Create classification model using decision-tree and hence classify
following tuple.
Tid Income Age Own House
ic ow
I (p, n) = I (7, 5)
= – (7/12) log2 (7/12) – (5/12) log2 (5/12)
= 0.979
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-25 Classification
Step 1 : Compute the entropy for Income : (Very high, high, medium, low)
e
Calculate entropy using the values from the above table and the formula given below :
g
v
pi + ni
E(A) = Σ p + n I (pi,ni)
io eld
i=1
Calculate entropy using the values from the above table and the formula given below :
v
p1 + n1
E (A) = ∑ p + n I (p1, n1)
i=1
= 0.904
Since income has four possible values, the root node has four branches (very high, high, medium, low).
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-26 Classification
Fig. P. 5.2.4(a)
Step 3 :
Since we have used income at the root, now we have to decide on the age attribute.
Consider income = “very high” and count the number of tuples from the original given training set
Svery high = 2
Since both the tuples have class label = “yes” , so directly give “yes” as a class label below “very high”.
Similarly check the tuples for income = “high” and income = “low” , are having the class label “yes” and “rented”
respectively.
g e
Now check for income = “medium”, where number of tuples having “yes” class label is 1 and tuples having “rented”
io eld
class label are 2.
So put the age label below income=“medium”.
So the final decision tree is :
ic ow
n
bl kn
at
Pu ch
Te
Fig. P. 5.2.4(b)
Ex. 5.2.5 : Data Set: A set of classified objects is given as below. Apply ID3 to generate tree.
Attribute
Sr. No. Colour Outline Dot Shape
1 Green Dashed No Triangle
2 Green Dashed Yes Triangle
3 Yellow Dashed No Square
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-27 Classification
Attribute
Sr. No. Colour Outline Dot Shape
4 Red Dashed No Square
5 Red Solid No Square
6 Red Solid Yes Triangle
7 Green Solid No Square
8 Green Dashed No Triangle
9 Yellow Solid Yes Square
10 Red Solid No Square
11 Green Solid Yes Square
12 Yellow Dashed Yes Square
13 Yellow Soild No Square
e
14 Red Dashed yes Triangle
g
Soln. :
io eld
Class N : Shape = “Triangle”
P(square) = 9/14
P(triangle) = 5/14
Te
I (p, n) = I (9,5)
= – (9/14) log2 (9/14) – (5/14) log2 (5/14)
= 0.940
Fig. P. 5.2.5(a)
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-28 Classification
Calculate Entropy using the values from the above table and the formula given below :
v
pi + ni
E(A) = Σ
p + n I (pi,ni)
e
i=1
g
E (Color) = 5/14 * I(3,2) + 5/14 * I(2,3) + 4/14 * I(4,0)
io eld
= 0.694
Solid 6 1 0.621
Calculate Entropy using the values from the above table and the formula given below :
v
p1 + n1
E (A) = ∑ p + n I (p1, n1)
i=1
Hence
Similarly for different dot values, I(pi , ni) is calculated as given below :
Outline pi ni I(pi, ni)
No 6 2 0.811
Yes 3 3 1
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-29 Classification
Calculate entropy using the values from the above table and the formula given below :
v
p1 + n1
E (A) = ∑ p + n I (p1, n1)
i=1
= 0.892
e
Gain(S, outline) = 0.137
g
io eld
Gain(S, dot ) = 0.048
Fig. P. 5.2.5(b)
at
Pu ch
Step 4 : As attribute color is at the root, we have to decide on the remaining two attribute for red branch node.
Consider color =red and count the number of tuples from the original given training set
Te
Attribute Shape
Color Outline Dot
1. Red Dashed No Square
2. Red Solid No Square
3. Red Solid Yes Triangle
4. Red Solid No Square
5. Red Dashed Yes Triangle
Similarly for different outline values, I(pi , ni) is calculated as given below.
Outline pi ni I(pi, ni)
Dashed 1 1 1
Solid 2 1 0.918
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-30 Classification
Calculate Entropy using the values from the above table and the formula given as :
v
p1 + n1
E (A) = ∑ p + n I (p1, n1)
i=1
Hence
Similarly for different Dot values, I(pi , ni) is calculated as given below :
e
Outline pi ni I(pi, ni)
g
No 3 0 0
io eld
Yes 0 2 0
Calculate entropy using the values from the above table and the formula given below
ic ow
v
p1 + n1
n
E (A) = ∑ p + n I (p1, n1)
bl kn
i=1
= 0.971– 0 = 0.971
Check the tuples with Dot = “yes” from sample Sred , it has class triangle.
Check the tuples with Dot = “no” from sample Sred , it has class square
So the partial tree for red color sample is as given in Fig. P. 5.2.5(c).
Fig. P. 5.2.5(c)
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-31 Classification
Step 5 : Consider Color = Yellow and count the number of tuples from the original given training set.
Attribute Shape
Color Outline Dot
1. Yellow Dashed No Square
2. Yellow Solid Yes Square
3. Yellow Dashed Yes Square
4. Yellow Solid No Square
As all the tuples belong to yellow color have class label square, so directly assign a class label below the node
color = “yellow” as square.
g e
io eld
ic ow
n
Fig. P. 5.2.5(d)
bl kn
Step 6 : Consider Color = green and count the number of tuples from the original given training set, as only attribute
outline has left, it becomes a node below color =“green”.
at
Pu ch
Attribute Shape
Color Outline Dot
Te
Fig. P. 5.2.5(e)
Check the tuples with Outline = “dashed” from sample Sgreen , it has class triangle.
Check the tuples with outline = “solid” from sample Sgreen , it has class square.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-32 Classification
Fig. P. 5.2.5(f)
Ex. 5.2.6 : Apply statistical based algorithm to obtain the actual probabilities of each event to classify the new tuple as a
e
tall. Use the following data
g
io eld Person ID Name Gender Height Class
1 Kristina Female 1.6m Short
2 Jim Male 2m Tall
3 Maggie Female 1.9m Medium
ic ow
Soln. :
P(t|Short) = 1/4 * 0 = 0
P(t|Medium) = 1/3 * 0 = 0
P(t|Tall) = 2/2 * 1 /2 = 0.5
Therefore likelihood of being short = p(t|short)* P(short) = 0 * 4/9 = 0
e
Finally Actually probabilities of each event
g
P(Short | t) = (P(t|short) * p(short) )/ P(t)
io eld
P(Short | t) = (0 * 4 /9)/0.11 = 0
Ex. 5.2.7 : The training data is supposed to be a part of a transportation study regarding mode choice to select Bus, Car or
at
Pu ch
Gender Car ownership Travel cost ($)/km Income level Transportation mode
Male 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Female 1 Cheap Medium Train
Female 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Male 0 Standard Medium Train
Female 1 Standard Medium Train
Female 1 Expensive High Car
Male 2 Expensive Medium Car
Female 2 Expensive High Car
Suppose we have new unseen records of a person from the same location where the data sample was taken. The
following data are called test data (in contrast to training data) because we would like to examine the classes of these
data.
Person name Gender Car ownership Travel cost ($)/km Income level Transportation mode
Alex Male 1 Standard High ?
Buddy Male 0 Cheap Medium ?
Cherry Female 1 Cheap High ?
The question is what transportation mode would Alex, Buddy and Cheery use?
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-34 Classification
Soln. :
Class P : Transportation mode= “Bus”
Class Q : Transportation mode= “Train”
Class N : Transportation mode= “Car”
Total no. of records: 10
No. of records with “Bus” class = 4
No. if records with “Train” class = 3
No. of records with “Car” class = 3
So, Information Gain = I(p,q,n)
= – (p/(p + q + n)) log2 (p/(p + q + n)) – (q/(p + q + n)) log2 (q/(p + q + n)) – (n/(p + q + n)) log2 (n/(p + q + n))
e
I(4,3,3) = 0.5288 + 0.5211 + 0.5211
g
I(4,3,3) = 1.571
io eld
Step 1 : Compute the entropy of gender : (Male, Female)
For gender = Male pi = 3 qi = 1 ni = 1
ic ow
Therefore ,
I(pi , qi, ni) = I(3,1,1)
n
bl kn
= 1.371
at
Pu ch
Similarly for different gender I(pi , qi, ni) is calculated as given below :
Male 3 1 1 1.371
Female 1 2 2 1.522
Calculate Entropy using the values from the above table and the formula given below
v
pi + ni
E (A) = Σ p + n I (pi, ni)
i=1
Hence,
Gain(S, gender) = I (p, q, n) – E (gender)
= 1.571 – 1.447=0.124
Similarly,
Travel cost ($)/Km attribute has the highest gain, therefore it is used as the decision attribute in the root node.
Since travel cost ($)/Km has three possible values, the root node has three branches (Cheap, Standard, Expensive).
Since for all the attributes of Travel Cost ($)/Km = expensive, Transportation mode = “Car” , so assign class ‘Car’ to
expensive.
Since for all the attributes of Travel Cost ($)/Km = Standard, Transportation mode = “Train”, so assign class ‘Train’ to
standard.
g e
Fig. P. 5.2.7(a)
io eld
Consider travel cost ($)/Km = Cheap and count the number of tuples from the original given training set
Scheap = 5
ic ow
Attributes Classes
Gender Car ownership Travel cost Income level Transportation
n
($)/km mode
bl kn
= 0.722
= 0
Similarly for different genders I(pi , qi) is calculated as given below :
Gender pi qi ni I(pi, qi ,ni)
Male 3 0 0 0
Female 1 1 0 1
Calculate entropy using the values from the above table and the formula given below
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-36 Classification
v
pi + ni
E (A) = Σ p + n I (pi, ni)
i=1
E (gender) = 3/5 * I (3, 0,0) + 2/5* I(1, 1,0)
= 0.4
Note : Scheap is the total training set.
Hence,
Gain(Scheap,gender) = I (p,q, n) – E (gender)
= 0.722 – 0.4 = 0.322
(ii) Compute the entropy for Car ownership: (0, 1, 2)
For Car ownership = 0,
e
pi = with “bus” class = 2 ,qi = with “train” class = 0 and ni with “car” class = 0
g
Therefore, I(pi , qi, ni) = I(2,0,0)
io eld
= – (2/2) log2 (2/2) – (0/2) log2 (0/2) – (0/2) log2 (0/2) = 0.
Similarly for different outlook ranges I(pi , qi ,ni) is calculated as given below :
ic ow
1 2 1 0 0.918
2 0 0 0 0
at
Pu ch
Calculate Entropy using the values from the above table and the formula given below
v
Te
pi + ni
E (A) = Σ p + n I (pi, ni)
i=1
E (Car ownership) = 2/5 * I (2, 0,0) + 3/5* I(2, 1,0) + 0/5* I(0,0,0)
= 0.551
Note : Scheap is the total training set.
Therefore,
I(pi, qi, ni) = I(2,0,0)
Similarly for different outlook ranges I(pi , qi,ni) is calculated as given below :
Income level pi qi ni I(pi , qi, ni)
Low 2 0 0 0
Medium 2 1 0 0.918
High 0 0 0 0
Calculate Entropy using the values from the above table and the formula given below
v
pi + ni
E (A) = Σ p + n I (pi, ni)
i=1
E (Income Level) = 2/5 * I (2, 0, 0) + 3/5* I(2, 1,0) + 0/5* I(0,0,0) = 0.551
Note : Scheap is the total training set.
e
Hence, Gain(Scheap,Income level) = I (p,q, n) – E (Income level)
g
= 0.722 – 0.551 = 0.171
Therefore, since gender has the highest gain, it comes below cheap.
io eld
For all gender = Male, Transportation mode= bus
ic ow
n
bl kn
at
Pu ch
Te
Fig. P. 5.2.7(b)
Sfemale = 2
Gender Car ownership Income level Transportation mode
Female 1 Medium Train
Female 0 Low Bus
Suppose we select attribute car ownership, we can update our decision tree into the final version.
Fig. P. 5.2.7(c)
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-38 Classification
Ex. 5.2.8 : The table below shows a sample dataset of whether a customer responds to a survey or not. “Outcome” is the
class label. Construct a Decision Tree Classifier for the dataset. For a new example (Rural, semidetached,
low, No), what will be the predicted class label?
District House type Income Previous customer Outcome
Suburban Detached High No Nothing
Suburban Detached High Yes Nothing
Rural Detached High No Responded
Urban Semi-detached High No Responded
Urban Semi-detached Low No Responded
Urban Semi-detached Low Yes Nothing
Rural Semi-detached Low Yes Responded
Suburban Terrace High No Nothing
e
Suburban Semi-detached Low No Responded
g
Urban Terrace Low No Responded
io eld
Suburban Terrace Low Yes Responded
Rural Terrace High Yes Responded
ic ow
Count the number of records with “Responded” class and “Nothing” class.
So number of records with “Responded” class = 9 and “Nothing” class = 5
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-39 Classification
I (p, n) = I (9, 5)
= – (9/14) log2 (9/14) – (5/14) log2 (5/14)
g e
Therefore, I(pi , ni) = I(2,3)io eld
= – (2/5) log2 (2/5) – (3/5) log2(3/5) = 0.971.
Similarly for different District ranges I(pi , ni) is calculated as given below :
ic ow
Urban 3 2 0.971
at
Calculate entropy using the values from the above table and the formula given below :
Pu ch
v
pi + ni
E (A) = ∑
p + n I (pi, ni)
Te
i=1
5 4 5
E (District) = 14 I (2, 3) + 14 I(4, 0) + 14 I(3, 2) = 0.694
As District has only values “Suburban, Rural, Urban”, the root node has three branches
Step 2 :
As attribute District at root, we have to decide on the remaining three attribute for Suburban branch
Consider District = Suburban and count the number of tuples from the original given training set
g e
Therefore , I(pi , ni) = I(0,2)
= – (0/2) log2(0/2) – (2/2) log2(2/2)
io eld
= 0
Similarly for different District ranges I(pi , ni) is calculated as given below :
ic ow
Terrace 1 1 1
Semi-detached 1 0 0
at
Pu ch
Calculate Entropy using the values from the above table and the formula given below :
v
Te
pi + ni
E (A) = ∑
p + n I (pi, ni)
i=1
Therefore ,
I(pi, ni) = I(0,3) = – (0/3) log2 (0/3) – (3/3) log2 (3/3) = 0
Similarly for different District ranges I(pi , ni) is calculated as given below :
Income pi ni I(pi, ni)
High 0 3 0
Low 2 0 0
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-41 Classification
Calculate Entropy using the values from the above table and the formula given as follows :
v
pi + ni
E (A) = ∑
p + n I (pi, ni)
i=1
= 0.971 – 0 = 0.971
(iii) Compute the entropy for Previous_Customer : (No, Yes)
For Previous_Customer = No,
pi = with “Responded” class = 1 and ni = with “Nothing” class = 2
e
Therefore,
g
I(pi , ni) = I(1,2) = – (1/3) log2 (1/3) – (2/3) log2 (2/3) = 0.918
io eld
Similarly for different District ranges I(pi , ni) is calculated as given below :
Previous_Customer pi ni I(pi, ni)
No 1 2 0.918
ic ow
Yes 1 1 1
n
Calculate Entropy using the values from the above table and the formula given as:
bl kn
v
pi + ni
E (A) = ∑
p + n I (p1, n1)
at
i=1
Pu ch
Since for the attributes House_Type and Previous_Customer, Outcome = Responded, so assign class ‘Responded’ to
Rural.
Step 4 :
Consider House_Type and Previous_Customer for District = Urban and count the number of tuples from the
original given training set
e
14 Urban Terrace High Yes Nothing
g
Consider the above table as the new training set and calculate the Gain for House_Type and Previous_Customer.
io eld
Class P : Outcome = “Responded”
Class N : Outcome = “Nothing”
ic ow
p p n n
= – p + n log2 p + n –p + n log2 p + n
Te
I (p, n) = I (3, 2)
= – (3/5) log2 (3/5) – (2/5) log2 (2/5)
= 0.970
(iv) Compute the entropy for Previous_Customer
For Previous_Customer = No
pi = with “Responded” class = 3 and ni = with “Nothing” class = 0
Similarly for different District ranges I(pi , ni) is calculated as given below :
Previous_Customer pi ni I(pi, ni)
No 3 0 0
Yes 0 2 0
Calculate entropy using the values from the above table and the formula given as follows :
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-43 Classification
v
pi + ni
E (A) = ∑ p + n I (pi, ni)
i=1
3 2
E(Previous_Customer) = 5 I (3, 0) + 5 I (0, 2) = 0
e
House_Type pi ni I(pi, ni)
g Detached 0 0 0
io eld
Terrace 2 1 0.918
Semi-detached 1 1 1
ic ow
Calculate Entropy using the values from the above table and the formula given below :
n
v
pi + ni
bl kn
E (A) = ∑
p + n I (pi, ni)
i=1
at
Pu ch
A set of IT-THEN rules is used for classification in Rule Based Classification. It classifies the record based on collection
of IF-THEN rules. The syntax for rules is “IF condition THEN conclusion”
Example
e
If Rule is X Y where X is condition.
g
X is conjunctions of attributes and Y the class label of the rule
io eld
LHS of rule is rule antecedent and RHS is consequent.
Example
ic ow
2. Accuracy of a rule: Percentage of tuples that satisfy both the antecedent and consequent of a rule. i.e. percentage of
Te
Formulae
Example
(Status=Single) → No
1. Mutually exclusive rules : Every record of dataset is covered by at most one rule of classifier and the rules are
independent of each other
2. Exhaustive rules : Classifier generates rules for every possible combination of attribute values and each record is
covered by at least one rule.
Example
g e
io eld
ic ow
n
bl kn
Fig. 5.3.1
at
Pu ch
Classification rules
(Refund=Yes)==> No
Te
(Refund = No, Marital Status = {Single, Divorced}, Taxable Income>80K) ==> Yes
− Based on size, it has to order. So give highest priority to that triggering rule which has the maximum attribute test.
− Make the decision list based on the ordering of the rules. Rules are organized based on some measure of rule quality
or by taking expert opinion.
− Once the decision tree is created, list the rules which are easy to understand than big and complex tree.
− For every path of the tree, create a rule from root node to a leaf node.
g e
2. IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
io eld
3. IF age = “31…40” THEN buys_computer = “yes”
− Using sequential covering algorithm , IF-THEN rules can be extracted without generating a decision tree from training
Te
data.
− The rules can be learned one at a time.
− Sequential covering algorithms are the most widely used approach to mining disjunctive sets of classification rules
− Some of sequential covering algorithms are
o AQ,
o CN2
o more recent RIPPER
Algorithm
Input
Output
Method
e
This algorithm basic functionality is :
1. Start from an empty rule
g
io eld
2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
ic ow
− The network provides a graphical model of causal relationship on which learning can be performed
Fig. 5.5.1
− A portion of a belief network, consisting of a node X, having variable values (x1, x2, ...), its parents (A and B), and its
children (C and D)
Example 1
e
− You have a new burglar alarm installed at home.
g
−
io eld
It is fairly reliable at detecting burglary, but also sometimes responds to minor earthquakes.
− You have two neighbors, Ali and Veli, who promised to call you at work when they hear the alarm.
− Ali always calls when he hears the alarm, but sometimes confuses telephone ringing with the alarm and calls too.
ic ow
The Bayesian network for the burglar alarm example. Buglary (B) and earthquake (E) directly affect the probability of
at
Pu ch
the alarm (A) going off, but whether or not Ali calls (AC) or Veli calls (VC) depends only on the alarm.
Te
A P(VC = T) P(VC = F)
T 0.70 0.30
F 0.01 0.99
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-49 Classification
A P(AC = T) P(AC = F)
T 0.90 0.10
F 0.05 0.95
− What is the probability that the alarm has sounded but neither a burglary nor an earthquake has occurred, and both
Ali and Veli Call ?
P(AC, VC, A, ¬ B, ¬ E)
= 0.00062
(Capital letter represent variables having the value true and ¬ represents negation)
e
Example 2
g
− Suppose we observe the fact that the grass is wet. There are two possible causes for this : either it rained, or the
io eld
sprinkler was on. Which one is more likely ?
P(S, W) 0.2781
P(S|W) = P(W) = 0.6471 = 0.430
ic ow
P(R, W) 0.4581
P(R|W) = P(W) = 0.6471 = 0.708
n
− We see that it is more likely that the grass is wet because it rained.
bl kn
Another Bayesian network example. The event that the grass being wet (W = true) has two possible causes : either the
at
P(C = T) P(C = F)
0.50 0.50
C P(S = T) P(S = F)
T 0.10 0.90
F 0.50 0.50
C P(R = T) P(R = F)
T 0.80 0.20
F 0.20 0.80
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-50 Classification
S R P(W = T) P(W = F)
T T 0.99 0.01
T F 0.90 0.10
F T 0.90 0.10
F F 0.00 1.00
1. Machine learning
2. Statistics
3. Computer vision
4. Natural language Processing
5. Speech recognition
e
6. Error-control codes
g
7. Bioinformatics Medical diagnosis
io eld
8. Weather forecasting
The network topology (nodes and arcs) may be constructed by a human experts or inferred from the data
Pu ch
o
o The network variables may be observable or hidden in all or some of the training tuples.
− Training the Network if the network topology is known and the variables are observable
Te
Frequent patterns generated from association can be used for classification is called associative classification. Initially
association rules are generated from frequent patterns and used for classification.
Steps involved in associative classification :
1. Find frequent itemsets, i.e commonly occurring attribute–valuepairs in the data.
2. Generate association rules by analysing frequent itemsets as per the class by considering class confidence and
support criteria.
3. Organize the rules to form a rule-based classifier.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-51 Classification
5.7.1 CBA
One of the earliest and simplest algorithms for associative classification is CBA (Classification Based on Associations).
Steps in CBA
5.7.2 CMAR
Steps in CMAR
e
1. Mine for CARs satisfying support and confidence thresholds
g
2. Sort all CARs based on confidence
io eld
3. Find all CARs which satisfy the given query
4. Group them based on their class label
ic ow
5. Classify the query to the class whose group of CARs has the maximum weight.
n
5.8 Lazy Learners : (or Learning from your Neighbors)
bl kn
− Eager learners are those classification techniques in which a given set of training tuples contructs a generalised model
and then uses the same to classify a previously unseen tuple.
Te
− Examples of Eager learners are Decision tree induction, Bayesian Classification, Rule based classification, classification
by backpropogation, support vector machines and classification based on Association rule Mining.
− In lazy Learner approach, the learner waits until the last minute before doing model construction to classify a given
test tuple.
− A lazy learner approach performs generalization only when its sees a test tuple. Until then it only stores training tuple
or does very little processing.
− Lazy learners do very less amount of processing when training tuples are presented and does more amount of work
when a classification or numeric prediction is to be done.
− Since Lazy learners stores training tuples or instances, it is also known as instance -based learners.
3. Able to model complex decision spaces having hyperpolygonal shapes that may not be describable by any other
learning algorithms
4. Two examples of Lazy learners , K-nearest neighbors and case based reasoning
e
Q. Explain K-nearest neighbor classifier algorithm with suitable application. (Dec. 18, 5 Marks)
g
− K-Nearest Neighbors is used in the field of Pattern Recognition.
io eld
− It learns by analogy, i.e. by comparing a given test tuple with training tuples that are similar to it.
− The training tuples have n attributes, every tuple represents a point in n-dimensional space.
ic ow
− The k training tuples are the k “nearest neighbors” of the unknow tuple.
− Closeness is defined using distance metrics such as Euclidean distance.
at
Pu ch
− The Manhattan (city block) distance or other distance measurements, may also be used.
Te
− To find the Euclidean Distance between two points or tuples, the formula is given as follow :
Let Y1 = {y11,y12,y13,……y1n}
and Y2 = {y21,y22,y23,……y2n}
distance (Y1,Y2) = (y1i – y2i )2
− KNN classifiers can be extremely slow when classifying test tuples O(n).
− By simple presorting and arranging the stored tuples into search tree, the number of comparisons can be
reduced to O(log N).
− Example : if k = 5, it selects the 5 nearest neighbor as shown in Fig. 5.8.1.
Ex. 5.8.1 : Apply KNN algorithm to find class of new tissue paper (X1 = 3, X2 = 7). Assume K = 3
X1 = Acid Durability (Secs) X2 = Strength (kg/sq. meter) Y = Classification
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
Soln. :
1. Determine parameter K = number nearest neighbors
Support use K = 3
2. Calculate the distance between the query-instance and all the training samples.
g e
Coordinate of query instance is (3, 7), instead of calculating the distance we compute square distance which is faster
to calculate (without square root)
io eld
X1 = Acid Durability (seconds) X2 = Strength (kg/square meter) Square Distance to query instance (3, 7)
7 7 (7 – 3)2 + (7 – 7)2 = 16
ic ow
7 4 (7 – 3)2 + (4 – 7)2 = 25
n
3 4 (3 – 3)2 + (4 – 7)2 = 9
bl kn
1 4 (1 – 3)2 + (4 – 7)2 = 13
at
3. Sort the distance and determine nearest neighbors based on the K-th minimum distance.
Pu ch
X1 = Acid Durability X2 = Strength Square Distance to query Rank minimum Is it included in 3-Nearest
(seconds) (kg/square meter) instance (3, 7) distance neighbores?
Te
4. Gather the category Y of the nearest neigbors. Notice in the second row las column that the category of nearest
neighbor (Y) is not included because the rank of this data is more than 3 (= k).
X1 = Acid X2 = Strength Square Distance to Rank minimum Is it include in 3 Y = Category of
Durability (kg/square meter) query instance (3, 7) distance Nearest neighbours? nearest Neighbor
(seconds)
7 7 (7 – 3)2 + (7 – 7)2 = 16 3 Yes Bad
7 4 (7 – 3)2 + (4 – 7)2 + 25 4 No -
3 4 (3 – 3)2 + (4 – 7)2 = 9 1 Yes Good
1 4 (1 – 3)2 + (4 – 7)2 = 13 2 Yes Good
5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance
We have 2 good and 1 bad, since, 2 > 1 then we conclude that a new paper tissue that pass laboratory test with X1 = 3
and X2 = 7 is included in Good category.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 5-54 Classification
e
o revise and adapt the proposed solution if necessary;
g
o retain the final solution as part of a new case.
io eld
− Retrieving a case starts with a (possibly partial) problem description and ends when a best matching case has been
found. The subtasks involve :
identifying a set of relevant problem descriptors;
ic ow
o
o matching the case and returning a set of sufficiently similar cases (given a similarity threshold of some kind);
n
o selecting the best case from the set of cases returned.
bl kn
o Some systems retrieve cases based largely on superficial syntactic similarities among problem descriptors, while
advanced systems use semantic similarities.
at
Pu ch
− Reusing the retrieved case solution in the context of the new case focuses on: identifying the differences between the
retrieved and the current case; and identifying the part of a retrieved case which can be transferred to the new case.
Te
Generally, the solution of the retrieved case is transferred to the new case directly as its solution case.
− A CBR tool should support the four main processes of CBR: retrieval, reuse, revision and retention. A good tool should
support a variety of retrieval mechanisms and allow them to be mixed when necessary. In addition, the tool should be
able to handle large case libraries with retrieval time increasing linearly (at worst) with the number of cases.
Applications of CBR
Case based reasoning first appeared in commercial tools in the early 1990’s and since then has been sued to create
numerous applications in a wide range of domains :
1. Diagnosis
Case-based diagnosis systems try to retrieve past cases whose symptom lists are similar in nature to that of the new
case and suggest diagnoses based on the best matching retrieved cases. The majority of installed systems are of this
type and there are many medical CBR diagnostic systems.
2. Help Desk
Case-based diagnostic systems are used in the customer service area dealing with handling problems with a product
or service.
3. Assessment
Case-based systems are used to determine values for variables by comparing it to the known value of something
similar. Assessment tasks are quite common in the finance and marketing domains.
e
4. Decision support
g
In decision making, when faced with a complex problem, people often look for analogous problems for possible
io eld
solutions. CBR systems have been developed to support in this problem retrieval process (often at the level of
document retrieval) to find relevant similar problems. CBR is particularly good at querying structured, modular and
non-homogeneous documents.
ic ow
5. Design
n
Systems to support human designers in architectural and industrial design have been developed. These systems assist
bl kn
the user in only one part of the design process, that of retrieving past cases, and would need to be combined with
other forms of reasoning to support the full design process.
at
Pu ch
Review Questions
Te
Q. 3 Write a pseudo code for the construction of decision tree and also state its time complexity.
Unit VI
Syllabus
e
Sub sampling and Cross-Validation
g
io eld
6.1 Multiclass Classification
6.1.1 Introduction to Multiclass Classification
ic ow
− Classifier i is trained using tuples of class i as the positive class and the remaining tuples as the negative class.
− To classify an unknown tuple, X, the set of classifiers vote collectively.
− If classifier i predicts positive class for X, then class I gets one vote.
− If the classifier i predicts the negative class for X, then each of the classes except I gets one vote.
− This is an alternative approach that learns a classifier for each pair of classes.
− Given N classes, construct n(n-1)/2 classifiers.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 6-2 Multiclass Classification
− The class with maximum number of votes is assigned to the unknown tuple X.
− All-vs-All approach is better as compared to One-vs-All.
g e
io eld
Fig. 6.2.1 : Semi-Supervised Classification
(a) Self-training
ic ow
− The tuple with most confident label prediction is added to the labeled data.
Pu ch
(b) Co-training
− If the feature set is split into two sets and train two classifiers f1 and f2.
− Then f1 and f2 are used to predict the class labels for the unlabeled data.
− Each classifier then teaches the other in that tuple having the most confident prediction from f1 is added to the
set of labeled data for f2(along with its label).
− The tuple having the most confident prediction from f2 is added to the set of labeled data for f1.
− Cotraining is less error prone as compared to self training.
6.3 Reinforcement Learning
(SPPU - Dec. 16, May 17, Dec.18)
Q. Briefly explain the reinforcement learning. (Dec. 16, May 17, 6 Marks)
Q. Discuss Reinforcement learning relevance and its applications in real time environment. (Dec. 18, 4 Marks)
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 6-3 Multiclass Classification
− In machine learning learner knows which action to take but in reinforcement learning, learner doesn’t know the
action, it discover which action gives most reward signal.
− In this case, the agent has to act and get the learning through experience.
− All reinforcement learning agents have explicit goals and are intelligence to find the aspects of their environments, so
accordingly they can select the actions to control their environments.
e
Example
g
− In a chess game, player makes a move based on planning of move, expecting possible replies and even counter
io eld
replies. Then player takes immediate and spontaneous judgment and plays the move.
− In this example the agent (player) uses its experience to improve its performance and evaluate positions to improve
ic ow
1. A policy
2. A reward function
3. A value function
It is used for planning and predict the resultant next state and next reward.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 6-4 Multiclass Classification
− It uses knowledge acquired so far and while exploring, the action leads to learning through either rewards or
penalties.
− Rewards are related to specific actions and value function is the collective effect.
− To get the correct responses, environment needs to be model so that it can accept the inputs from changing scenarios
and finally can produce the optimized value.
− Fig 6.3.2 shows the typical reinforcement-learning scenario where action lead to reward.
− So for every action, there are environment as well as reinforcement functions.
g e
io eld
Fig. 6.3.2 : Reinforcement-learning scenario
ic ow
− Systematic learning considers complete system, its subsystem and the interactions between the systems for learning.
Based on this it makes the decisions.
Te
6.5 Multi-Perspective Decision Making for Big Data and Multi-Perspective Learning
for Big Data
e
(SPPU - Dec. 15, May 16, Dec. 16, May 17, Dec. 17, Dec. 18)
g
Q. Write a note on multi-perspective learning.
io eld (Dec. 15, May 17, Dec. 17, 4 Marks)
Q. Write short note on multi-perspective decision making. (May 16, 6 Marks)
Q. What is meant by multi perspective decision making? Explain. (Dec. 16, 6 Marks)
Q. Differentiate between wholistic learning and multiperspective learning (Dec. 18, 4 Marks)
ic ow
− Multi-perspective Learning builds knowledge from various perspectives so that it can be used for decision making
process.
Te
− The perspective includes context, scenario and situation, the way we look at a particular decision problem.
− In Fig. 6.5.1, P1, P2, P3…, Pn refers to different perspective in the Learning process.
− Each of this perspective is represented as a function of features.
− The representative feature set should contain all the possible features.
g e
io eld
ic ow
n
bl kn
− The relationships may also be represented using a decision tree is shown in Fig. 6.5.3.
at
Pu ch
Q. How is the performance of Classifiers algorithms evaluated. Discuss in detail. (Dec. 18, 8 Marks)
− Various methods for estimating a classifier’s accuracy are given below. All of them are based on randomly sampled
partitions of data :
o Holdout method
o Random subsampling
o Cross-validation
o Bootstrap
− If we want to compare classifiers to select the best one then the following methods are used :
o Confidence intervals
o Cost-benefit analysis and Receiver Operating Characteristic (ROC) Curves
e
(SPPU - Dec. 18)
g
Q. Explain following measures for evaluating classifier accuracy
io eld
i) Specificity
ii) Sensitivity
ic ow
iii) recall
iv) Precision (Dec. 18, 8 Marks)
n
Accuracy of a classifier M, acc(M) is the percentage of test set tuples that are correctly classified by the model M.
bl kn
Basic concepts
at
Pu ch
2. Success
Instance (record) class is predicted correctly.
3. Error
− It is a useful tool for analyzing how well your classifier can recognize tuples of different classes.
− If we have only two way classification then only four classification outcomes are possible which are given below in
the form of a confusion matrix :
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 6-8 Multiclass Classification
Predicted class
Total P’ N’ All
e
(v) P : Number of positive tuples.
g
(vi) N : The number of negative tuples.
io eld
(vii) P’ : The number of tuples that were labeled as positive.
(viii) N’ : The number of tuples that were labeled as negative
ic ow
Sensitivity = TP/P
at
Pu ch
6. Specificity : True Negative recognition rate which is the proportion of negative tuples that are correctly identified
Specificity = TN/N
Te
7. Classifier accuracy or recognition rate : Percentage of test set tuples that are correctly classified
Accuracy = (TP + TN)/All
OR
TP + TN
Accuracy = P+N
8. Error rate : A percentage of errors made over the whole set of instances (records) used for testing.
OR
FP + FN
Error rate = P + N
9. Precision : Percentage of tuples which are correctly classified as positive are actual positive. It is a measure of
exactness.
|TP|
Precision = |TP| + |FP|
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 6-9 Multiclass Classification
10. Recall : Percentage of positive tuples which the classifier labelled as positive. It is a measure of completeness.
|TP|
Recall = |TP| + |FN|
e
(iii) Scalability (iv) Interpretability
g
io eld
− Re-substitution error rate is a performance measure and is equivalent to training data error rate.
− It is difficult to get 0% error rate but it can be minimized, so low error rate is always preferable.
ic ow
6.6.2 Holdout
n
− In holdout method, data is divided into training data set and testing data set (usually 1/3 for testing, 2/3 for training).
bl kn
at
Pu ch
Fig. 6.6.1
Te
− To train the classifier, training data set is used and once the classifier is constructed then use test data set to estimate
the error rate of the classifier. If the training is more than better model is constructed and if the test data is more than
more accurate the error estimates.
− Problem : The samples might not be representative. For example, some classes might be represented with very few
instances or even with no instances at all.
− Solution : stratification is the method which ensures that both training and testing data have equal number of
samples of same class.
Fig. 6.6.2
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 6-10 Multiclass Classification
− For each data split we retrain the classifier from scratch with the training examples and estimate Ei with the test
examples.
− The overall accuracy is calculated by taking the average of the accuracies obtained from each iteration.
K
1
E = K Σ Ei
i=1
k-fold cross-validation
− Step First : Data is split into k subsets of equal size (usually by random sampling).
g e
io eld
ic ow
Fig. 6.6.3
n
− Step Second : Each subset in turn is used for testing and the remainder for training.
bl kn
− The advantage is that all the examples are used for both training and testing.
at
Pu ch
E = K Ei
i=1
Leave-one-out cross validation
− If dataset has N examples, then N experiments to be performed for Leave-one-out cross validation.
− For every experiment, training uses N-1 examples and remaining example for testing.
− The average error rate on test examples gives the true error.
N
1
E = N Σ Ei
i=1
− Stratified cross-validation : Subsets are stratified before the cross-validation is performed.
− Ten-fold cross-validation is repeated ten times and finally the results are averaged based on the previous 10 results.
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 6-11 Multiclass Classification
Fig. 6.6.4
Q. 1 For each of the following queries, identify and write the type of data mining task.
i) Find all credit applicants who are poor credit risks.
e
ii) Identify customers with similar buying habits,
g
iii) Find all items which are frequently purchased with milk.
io eld (Dec. 15, 6 Marks)
Ans. :
i) Classification : Based on credit applications , customers can be classified in various classes like poor, medium and high
credit risk types of customers
ic ow
ii) Clustering : Clusters can be formed based on similar type of buying patterns. Then the customers belongs to those
clusters can be identified.
n
iii) Association : Various items which has been frequently purchased with milk can be identified with association data
bl kn
mining task. Based on the support and confidence, milk can be associated with those frequent items.
at
Pu ch
1 4 Excellent h1 X4
2 3 Good h1 X7
3 2 Excellent h1 X2
4 3 Good h1 X7
5 4 Good h1 X8
6 2 Excellent h1 X2
7 3 Bad h2 X11
8 2 Bad h2 X10
9 3 Bad h3 X11
10 1 Bad h4 X9
Calculate the prior probabilities of each of the class h1, h2, h3, h4 and probabilities for data points X1, X4, X7 and X8,
belonging to the class h1. (Dec. 15, 8 Marks)
Ans. :
Assign ten data values for all combinations of credit and income :
1 2 3
Excellent x1 x3 x6
Good x2 x4 x5
Bad x7 x8/9 x10
Data Mining & Warehousing (SPPU-Sem. 7-Comp) 6-12 Multiclass Classification
2
P (h2) = 10 = 20 %
1
P (h3) = 10 = 10 %
1
P (h4) = 10 = 10 %
Q. 3 What are similarities and differences between reinforcement learning and artificial intelligence algorithms ?
(Dec. 16, 5 Marks)
Ans. :
Sr. No. Reinforcement Learning Artificial Intelligence
e
Reinforcement learning is a branch of Artificial Intelligence (AI) is an area in computer science that
g
1. Artificial Intelligence and type of machine emphasizes the creation of intelligent machines.
learning algorithms.
io eld
To maximize its performance, it allows One of the fundamental building blocks of artificial intelligence (AI)
software agents and machines to solutions is Learning. From a conceptual standpoint, learning is a
2.
automatically determine the ideal process that improves the knowledge of an AI program by making
ic ow
Applications : Manufacturing, inventory Applications : Expert systems, Speech recognition and Machine
Pu ch
The similarity between reinforcement learning and systematic machine learning is both are Adapts to evolving
environment.
Review Questions
at
io eld
n ge