Data Mining Algorithmes
Data Mining Algorithmes
ALGORITHMES
SUSHIL
1
KULKARNI
SUSHIL KULKARNI
INTENSIONS
SUSHILKULKARNI
KULKARNI
SUSHIL
CLASSIFICATION
PROBLEM
3
SUSHIL KULKARNI
CLASSIFICATION PROBLEM
Given a database D={t1,t2,,tn} and a set of
classes C={C1,,Cm}, the Classification
Problem is to define a mapping f: DC
where each ti is assigned to one class.
Actually divides D into equivalence
classes.
Prediction is similar, but may be viewed as
4having infinite number of classes.
SUSHIL KULKARNI
CLASSIFICATION EXAMPLES
Teachers classify students grades as A,
B, C, D, or F.
Identify mushrooms as poisonous or
edible.
Predict when a river will flood.
Identify individuals with credit risks.
Speech recognition
Pattern recognition
5
SUSHIL KULKARNI
CLASSIFICATION EXAMPLE:
MARKS
x
<90
>=90
A
>=80
B
>=70
C
>=60
D
SUSHIL KULKARNI
CLASSIFICATION EXAMPLE
Letter Recognition
View letters as constructed from 5 components:
Letter A
Letter B
Letter C
Letter D
Letter E
Letter F
SUSHIL KULKARNI
CLASSIFICATION
TECHNIQUES
8
SUSHIL KULKARNI
CLASSIFICATION
TECHNIQUES
Regression
Distance
Decision Trees
Rules
Neural Networks
9
SUSHIL KULKARNI
CLASSIFICATION TECHNIQUES
Approach:
Create specific model by evaluating
training data (or using domain experts
knowledge)
Apply model developed to new data.
10
SUSHIL KULKARNI
CLASSIFICATION TECHNIQUES
Classes must be predefined
Most common techniques use DTs, NNs,
or are based on distances or statistical
methods.
11
SUSHIL KULKARNI
DEFINE CLASSES
Distance Based
Partitioning Based
12
SUSHIL KULKARNI
ISSUES IN CLASSIFICATION
View letters as constructed from 5
components:
Missing Data
1. Ignore
2. Replace with assumed value
Measuring Performance
1. Classification accuracy on test data
2. Confusion matrix
133. OC Curve
SUSHIL KULKARNI
INTENSIONS
SUSHIL KULKARNI
SUSHIL KULKARNI
PERFORMANCE
MEASURE
15
SUSHIL KULKARNI
16
SUSHIL KULKARNI
MEASURING PERFORMANCE IN
CLASSIFICATION
C j is a specific class and t I is a database
tuple, may or may not be assigned to that
class while its actual membership may or
may not be in mat class. This gives four
parts as shown below:
17
SUSHIL KULKARNI
MEASURING PERFORMANCE IN
CLASSIFICATION
CLASSIFICATION
PERFORMANCE
19
True Positive
False Negative
False Positive
True Negative
SUSHIL KULKARNI
OPERATING CHARECTERISTIC
CURVE
It shows the relation ship between false
positives and true positives
OC curve was originally used to examine
false alarm rates.
20
SUSHIL KULKARNI
OPERATING CHARACTERISTIC
CURVE
21
SUSHIL KULKARNI
CONFUSION MATRIX
It illustrates the accuracy of solution to a
classification problem
Definition:
Given m classes, a confusion matrix is an m
by m matrix where each entry indicates the
number of tuples from D that were assigned
to class C j but where correct class is C i
22
SUSHIL KULKARNI
CONFUSION MATRIX
EXAMPLE
Using height data example with Output 1
correct and Output 2 actual assignment
23
SUSHIL KULKARNI
STATISTICAL
BASED
ALGORITHMS
24
SUSHIL KULKARNI
REGRESSION
Assume data fits a predefined function
Determine best values for regression
coefficients c 0,c1,,cn.
Linear Regression:
y = c0+ c1x1++ cnxn
Assume an error: y = c0+ c1x1++ cnxn+ e
25
SUSHIL KULKARNI
Linear Regression
26
SUSHIL KULKARNI
27
SUSHIL KULKARNI
CLASSIFICATION USING
REGRESSION
Division: Use regression function to
divide area into regions.
Prediction: Use regression function to
predict a class membership function.
Input includes desired class.
28
SUSHIL KULKARNI
DIVISION
29
SUSHIL KULKARNI
PREDICTION
30
SUSHIL KULKARNI
CORRELATION
Examine the degree to which the
values for two variables behave
similarly.
Correlation coefficient r:
1 = perfect correlation
-1 = perfect but opposite correlation
0 = no correlation
31
SUSHIL KULKARNI
BAYES THEOREM
Posterior Probability: P(h1|xi)
Prior Probability: P(h1)
Bayes Theorem:
SUSHIL KULKARNI
SUSHIL KULKARNI
Bayes Example(contd)
Training Data:
34
SUSHIL KULKARNI
INTENSIONS
35
SUSHIL KULKARNI
SUSHIL KULKARNI
DISTANCE BASED
ALGORITHM
36
SUSHIL KULKARNI
SIMILARITY MEASURES
Determine similarity between two objects.
Similarity characteristics:
37
SUSHIL KULKARNI
SIMILARITY MEASURES
Similarity characteristics:
Sim( t i, t i ) = 1
SIMILARITY
Sim( t i, t j ) = 0
No SIMILARITY
38
SUSHIL KULKARNI
SIMILARITY MEASURES
39
SUSHIL KULKARNI
CLASSIFICATION USING
DISTANCE
Place items in class to which they are
closest
40
SUSHIL KULKARNI
CLASSIFICATION USING
DISTANCE
Must determine distance between an item
and a class.
41
SUSHIL KULKARNI
DISTANCE MEASURES
Measure dissimilarity between objects
42
SUSHIL KULKARNI
CLASSIFICATION USING
DISTANCE
Classes represented by
1. Centroid: Central value.
2. Medoid: Representative point.
3. Individual points
Algorithm: KNN
43
SUSHIL KULKARNI
SUSHIL KULKARNI
KNN
45
SUSHIL KULKARNI
KNN ALGORITHM
46
SUSHIL KULKARNI
DECISION TREE
47
SUSHIL KULKARNI
DECISION TREE
Tree where the root and each internal
node is labeled with a question.
The arcs represent each possible
answer to the associated question.
Each leaf node represents a
prediction of a solution to the
problem.
48
SUSHIL KULKARNI
DECISION TREE
Popular technique for classification;
Leaf node indicates class to which the
corresponding tuple belongs.
49
SUSHIL KULKARNI
50
SUSHIL KULKARNI
DECISION TREE
Given:
D = {t1, , tn} where ti=<ti1, , tih>
Database schema contains
{A1, A2, , Ah}
Classes C={C1, ., Cm}
51
SUSHIL KULKARNI
DECISION TREE
Decision or Classification Tree is a tree
associated with D such that
Each internal node is labeled with
attribute, Ai
Each arc is labeled with predicate
which can be applied to attribute at
parent
52
SUSHIL KULKARNI
SUSHIL KULKARNI
55
SUSHIL KULKARNI
DIRECTED TREE :
ADVANTAGES
Easy to understand.
Easy to generate rules
56
SUSHIL KULKARNI
DIRECTED TREE :
DISADVANTAGES
May suffer from over fitting.
Classifies by rectangular partitioning.
Does not easily handle nonnumeric
data.
Can be quite large pruning is
necessary.
57
SUSHIL KULKARNI
CLASSIFICATION USING
DECISION TREE
Partitioning based: Divide search
space into rectangular regions.
Tuple placed into class based on the
region within which it falls.
DT approaches differ in how the tree is
built: DT Induction
58
SUSHIL KULKARNI
CLASSIFICATION USING
DECISION TREE
Internal nodes associated with attribute
and arcs with values for that attribute.
Algorithms: ID3, C4.5, CART
59
SUSHIL KULKARNI
DT INDUCTION
60
SUSHIL KULKARNI
DT SPLIT AREA
Gender
M
F
Height
61
SUSHIL KULKARNI
COMPARING DTs
62
Balanced
Deep
SUSHIL KULKARNI
DT ISSUES
Choosing Splitting Attributes
Ordering of Splitting Attributes
Splits
Tree Structure
Stopping Criteria
Training Data
63
SUSHIL KULKARNI
64
SUSHIL KULKARNI
INFORMATION
65
SUSHIL KULKARNI
DT INDUCTION
When all the marbles in the bowl are
mixed up, little information is given.
When the marbles in the bowl are all
from one class and those in the other
two classes are on either side, more
information is given.
Use this approach with DT Induction !
66
SUSHIL KULKARNI
ARTIFICIAL NEURAL
NETWORK (ANN)
ANN is an information processing
paradigm that is inspired by the way
brain process information.
Composed of a large number of highly
interconnected
processing
elements
called neurones.
ANNs, like people, learn by example.
67
SUSHIL KULKARNI
ARTIFICIAL NEURAL
NETWORK (ANN)
Learning in biological systems involves
adjustments to the synaptic connections
that exist between the neurones.
This is true of ANNs as well.
68
SUSHIL KULKARNI
69
SUSHIL KULKARNI
Cont..
70
SUSHIL KULKARNI
71
SUSHIL KULKARNI
SUSHIL KULKARNI
A SIMPLE NEURON
73
SUSHIL KULKARNI
NEURAL NETWORKS
Based on observed functioning of
human brain.
(Artificial Neural Networks (ANN)
The first artificial neuron was
produced
in
1943
by
the
neurophysiologist Warren McCulloch
and the logician Walter Pits.
74
SUSHIL KULKARNI
NEURAL NETWORKS
Our view of neural networks is very
simplistic.
We view a neural network (NN) from a
graphical viewpoint.
Used in pattern recognition, speech
recognition, computer vision, and
classification.
75
SUSHIL KULKARNI
76
SUSHIL KULKARNI
NEURAL NETWORKS
It is a directed graph F=<V,A> with vertices
V={1,2,,n} and arcs A={<i,j>|1<=i,j<=n},
with the following restrictions:
V is partitioned into a set of input nodes, V I,
hidden nodes, VH, and output nodes, VO.
SUSHIL KULKARNI
NEURAL NETWORKS
Any arc <i,j> must have node i in layer
h-1 and node j in layer h.
Arc <i,j> is labeled with a numeric value
wij.
Node i is labeled with a function fi.
78
SUSHIL KULKARNI
NN NODE
79
SUSHIL KULKARNI
SUSHIL KULKARNI
NN : Advantages
Learning
Can continue learning even after training
set has been applied.
Easy parallelization
Solves many problems
81
SUSHIL KULKARNI
NN: Disadvantages
Difficult to understand
May suffer from overfitting
Structure of graph must be determined a
priori.
Input values must be numeric.
Verification difficult.
82
SUSHIL KULKARNI
CLASSIFICATION USING
NEURAL NETWORKS
Typical NN structure for classification:
1. One output node per class
2.Output value is class membership
function value
Supervised learning
83
SUSHIL KULKARNI
CLASSIFICATION USING
NEURAL NETWORKS
For each tuple in training set, propagate it
through NN. Adjust weights on edges to
improve future classification.
Algorithms: Propagation, Back
propagation, Gradient Descent
84
SUSHIL KULKARNI
NN ISSUES
Number of source nodes
Number of hidden layers
Training data
Number of sinks
Interconnections
85
SUSHIL KULKARNI
NN ISSUES
Weights
Activation Functions
Learning Technique
When to stop learning
86
SUSHIL KULKARNI
87
SUSHIL KULKARNI
PRPOGATION
Tuple Input
Output
88
SUSHIL KULKARNI
NN PROPOGATION ALGORITHM
89
SUSHIL KULKARNI
EXAMPLE PROPOGATION
90
SUSHIL KULKARNI
RULES
91
SUSHIL KULKARNI
SUSHIL KULKARNI
93
SUSHIL KULKARNI
94
SUSHIL KULKARNI
95
SUSHIL KULKARNI
1R ALGORITHM
96
SUSHIL KULKARNI
1R EXAMPLE
97
SUSHIL KULKARNI
PRISM ALGORITHM
98
SUSHIL KULKARNI
PRISM EXAMPLE
99
SUSHIL KULKARNI
100
Rules have no
ordering of
predicates.
Only need to look
at one class to
generate its rules.
SUSHIL KULKARNI
INTENSIONS
Clustering Examples
Segment customer database based on
similar buying patterns.
Group houses in a town into
neighborhoods based on similar features.
Identify new plant species
Identify similar Web usage patterns
101
SUSHIL KULKARNI
CLUSTERING
102
SUSHIL KULKARNI
CLUSTERING : Example
103
SUSHIL KULKARNI
CLUSTERING HOUSES
Geographic
Size
Distance
Based Based
104
SUSHIL KULKARNI
Unsupervised learning
105
SUSHIL KULKARNI
Clustering Issues
Outlier handling
Dynamic data
Interpreting results
Evaluating results
Number of clusters
Data to be used
Scalability
106
SUSHIL KULKARNI
Impact of Outliers on
Clustering
107
SUSHIL KULKARNI
Clustering Problem
Given a database D={t1,t2,,tn} of tuples
and an integer value k, the Clustering
Problem is to define a mapping f:D{1,..,k}
where each ti is assigned to one cluster Kj,
1<=j<=k.
A Cluster, Kj, contains precisely those
tuples mapped to it.
Unlike classification problem, clusters are
not known a priori.
108
SUSHIL KULKARNI
Types of Clustering
Hierarchical Nested set of clusters
created.
Partitional One set of clusters created.
Incremental Each element handled one
at a time.
Simultaneous All elements handled
together.
Overlapping/Non-overlapping
109
SUSHIL KULKARNI
Clustering Approaches
Clustering
Hierarchical
Agglomerative
110
Partitional
Divisive
Categorical
Sampling
Large DB
Compression
SUSHIL KULKARNI
Cluster Parameters
111
SUSHIL KULKARNI
112
SUSHIL KULKARNI
Hierarchical Clustering
Clusters are created in levels actually creating
sets of clusters at each level.
Agglomerative
Initially each item in its own cluster
Iteratively clusters are merged together
Bottom Up
Divisive
Initially all items in one cluster
Large clusters are successively divided
Top Down
113
SUSHIL KULKARNI
Hierarchical Algorithms
Single Link
MST Single Link
Complete Link
Average Link
114
SUSHIL KULKARNI
Dendrogram
Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
Each level shows clusters for
that level.
Leaf individual clusters
Root one cluster
SUSHIL KULKARNI
Agglomerative Example
A B C D E
A
D
Threshold of
1 2 34 5
A B C D E
116
SUSHIL KULKARNI
MST Example
A
A B C D E
117
SUSHIL KULKARNI
118
SUSHIL KULKARNI
Agglomerative Algorithm
119
SUSHIL KULKARNI
Single Link
View all items with links (distances)
between them.
Finds maximal connected components
in this graph.
Two clusters are merged if there is at
least one edge which connects them.
Uses threshold distances at each level.
Could be agglomerative or divisive.
120
SUSHIL KULKARNI
121
SUSHIL KULKARNI
122
SUSHIL KULKARNI
Partitional Clustering
Nonhierarchical
Creates clusters in one step as opposed
to several steps.
Since only one set of clusters is output,
the user normally has to input the desired
number of clusters, k.
Usually deals with static sets.
123
SUSHIL KULKARNI
Partitional Algorithms
MST
Squared Error
K-Means
Nearest Neighbor
PAM
BEA
GA
124
SUSHIL KULKARNI
K-Means
Initial set of clusters randomly chosen.
Iteratively, items are moved among sets
of clusters until the desired set is
reached.
High degree of similarity among
elements in a cluster is obtained.
Given a cluster Ki={ti1,ti2,,tim}, the
cluster mean is mi = (1/m)(ti1 + + tim)
125
SUSHIL KULKARNI
K-Means Example
Given: {2,4,10,12,3,20,30,11,25}, k=2
Randomly assign means: m1=3,m2=4
K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
K1={2,3,4},K2={10,12,20,30,11,25},
m1=3,m2=18
K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
K1={2,3,4,10,11,12},K2={20,30,25},
m1=7,m2=25
Stop as the clusters with these means are the
same.
126
SUSHIL KULKARNI
K-Means Algorithm
127
SUSHIL KULKARNI
Nearest Neighbor
Items are iteratively merged into the
existing clusters that are closest.
Incremental
Threshold, t, used to determine if items
are added to existing clusters or a new
cluster is created.
128
SUSHIL KULKARNI
129
SUSHIL KULKARNI
SUSHIL KULKARNI
SUSHIL KULKARNI
BIRCH
Balanced Iterative Reducing and
Clustering using Hierarchies
Incremental, hierarchical, one scan
Save clustering information in a tree
Each entry in the tree contains
information about one cluster
New nodes inserted in closest entry in
tree
132
SUSHIL KULKARNI
Clustering Feature
CT Triple: (N,LS,SS)
N: Number of points in cluster
LS: Sum of points in the cluster
SS: Sum of squares of points in the cluster
CF Tree
Balanced search tree
Node has CF triple for each child
Leaf node represents cluster and has CF value
for each subcluster in it.
Subcluster has maximum diameter
133
SUSHIL KULKARNI
CURE
Clustering Using Representatives
Use many points to represent a cluster
instead of only one
Points will be well scattered
134
SUSHIL KULKARNI
CURE Approach
135
SUSHIL KULKARNI
CURE Algorithm
136
SUSHIL KULKARNI
137
SUSHIL KULKARNI
ASSOCIATION
RULES
138
SUSHIL KULKARNI
Comparing Techniques
Incremental Algorithm
Advanced AR Techniques
139
SUSHIL KULKARNI
Uses:
Placement
Advertising
Sales
Coupons
SUSHIL KULKARNI
SUSHIL KULKARNI
SUSHIL KULKARNI
SUSHIL KULKARNI
144
SUSHIL KULKARNI
SUSHIL KULKARNI
146
SUSHIL KULKARNI
147
SUSHIL KULKARNI
Apriori
Large Itemset Property:
Any subset of a large itemset is large.
Contrapositive:
If an itemset is not large, none of its
supersets are large.
148
SUSHIL KULKARNI
149
SUSHIL KULKARNI
Apriori Ex (contd)
s=30%
150
= 50%
SUSHIL KULKARNI
Apriori Algorithm
1. C1 = Itemsets of size one in I;
2. Determine all large itemsets of size 1, L1;
3. i = 1;
4. Repeat
5.
i = i + 1;
6.
Ci = Apriori-Gen(Li-1);
7.
SUSHIL KULKARNI
Apriori-Gen
Generate candidates of size i+1 from large
itemsets of size i.
Approach used: join large itemsets of size
i if they agree on i-1
May also prune candidates who have
subsets that are not large.
152
SUSHIL KULKARNI
Apriori-Gen Example
153
SUSHIL KULKARNI
154
SUSHIL KULKARNI
Apriori Adv/Disadv
Advantages:
Uses large itemset property.
Easily parallelized
Easy to implement.
Disadvantages:
Assumes transaction database is memory
resident.
Requires up to m database scans.
155
SUSHIL KULKARNI
Sampling
Large databases
Sample the database and apply Apriori to the
sample.
Potentially Large Itemsets (PL): Large
itemsets from sample
Negative Border (BD - ):
Generalization of Apriori-Gen applied to
itemsets of varying sizes.
Minimal set of itemsets which are not in PL,
but whose subsets are all in PL.
156
SUSHIL KULKARNI
PL
157
PL BD-(PL)
SUSHIL KULKARNI
Sampling Algorithm
1.
2.
3.
4.
5.
6.
7.
8.
158
Ds = sample of Database D;
PL = Large itemsets in Ds using smalls;
C = PL BD-(PL);
Count C in Database using s;
ML = large itemsets in BD-(PL);
If ML = then done
else C = repeated application of BD-;
Count C in Database;
SUSHIL KULKARNI
Sampling Example
Find AR assuming s = 20%
Ds = { t1,t2}
Smalls = 10%
PL = {{Bread}, {Jelly}, {PeanutButter},
{Bread,Jelly}, {Bread,PeanutButter}, {Jelly,
PeanutButter}, {Bread,Jelly,PeanutButter}}
BD-(PL)={{Beer},{Milk}}
ML = {{Beer}, {Milk}}
Repeated application of BD- generates all
remaining itemsets
159
SUSHIL KULKARNI
Sampling Adv/Disadv
Advantages:
Reduces number of database scans to one
in the best case and two in worst.
Scales better.
Disadvantages:
Potentially large number of candidates in
second pass
160
SUSHIL KULKARNI
Partitioning
Divide database into partitions D1,D2,
,Dp
Apply Apriori to each partition
Any large itemset must be large in at
least one partition.
161
SUSHIL KULKARNI
Partitioning Algorithm
1.
2.
3.
4.
5.
162
SUSHIL KULKARNI
Partitioning Example
L1 ={{Bread}, {Jelly},
{PeanutButter},
{Bread,Jelly},
{Bread,PeanutButter},
{Jelly, PeanutButter},
{Bread,Jelly,PeanutButter}}
D1
D2
S=10%
163
L2 ={{Bread}, {Milk},
{PeanutButter}, {Bread,Milk},
{Bread,PeanutButter}, {Milk,
PeanutButter},
{Bread,Milk,PeanutButter},
{Beer}, {Beer,Bread},
{Beer,Milk}}
SUSHIL KULKARNI
Partitioning Adv/Disadv
Advantages:
Adapts to available main memory
Easily parallelized
Maximum number of database scans is
two.
Disadvantages:
May have many candidates during second
scan.
164
SUSHIL KULKARNI
Parallelizing AR Algorithms
Based on Apriori
Techniques differ:
What is counted at each site
How data (transactions) are distributed
Data Parallelism
Data partitioned
Count Distribution Algorithm
Task Parallelism
Data and candidates partitioned
Data Distribution Algorithm
165
SUSHIL KULKARNI
T H A N K S !
166
SUSHIL KULKARNI