0% found this document useful (0 votes)

73 views

Data Mining Algorithmes

The document discusses various techniques for data classification including decision trees, distance-based algorithms like k-nearest neighbors (KNN), regression, and neural networks. It provides examples and definitions of key classification concepts. Specifically, it defines the classification problem as assigning data points to predefined classes, and discusses measuring classification performance using metrics like the confusion matrix and operating characteristic curve. It also explains how techniques like decision trees use a tree structure to partition the data space into rectangular regions to predict class membership.

Uploaded by

Lipika Girap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views

Data Mining Algorithmes

Uploaded by

Lipika Girap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 166

DATA MINING

ALGORITHMES

SUSHIL
1

KULKARNI
SUSHIL KULKARNI

INTENSIONS

Define classification problem using map and

illustrate with examples.

What are the different techniques to classify the

data into classes

List the approach in classification.

What are common methods to define classes?
Give suitable examples.

What are the different issues faced in doing

classification of data?

SUSHILKULKARNI
KULKARNI
SUSHIL

CLASSIFICATION
PROBLEM
3

SUSHIL KULKARNI

CLASSIFICATION PROBLEM
Given a database D={t1,t2,,tn} and a set of
classes C={C1,,Cm}, the Classification
Problem is to define a mapping f: DC
where each ti is assigned to one class.
Actually divides D into equivalence
classes.
Prediction is similar, but may be viewed as
4having infinite number of classes.
SUSHIL KULKARNI

CLASSIFICATION EXAMPLES
Teachers classify students grades as A,
B, C, D, or F.
Identify mushrooms as poisonous or
edible.
Predict when a river will flood.
Identify individuals with credit risks.
Speech recognition
Pattern recognition
5

SUSHIL KULKARNI

CLASSIFICATION EXAMPLE:
MARKS
x

<90

If x >= 90 then grade =A.

If 80<=x<90 then grade =B.
If 70<=x<80 then grade =C. <80
x
If 60<=x<70 then grade =D.
<70
If x<50 then grade =F
x
<50
6

>=90

A
>=80
B

>=70
C

>=60
D

SUSHIL KULKARNI

CLASSIFICATION EXAMPLE
Letter Recognition
View letters as constructed from 5 components:

Letter A

Letter B

Letter C

Letter D

Letter E

Letter F
SUSHIL KULKARNI

CLASSIFICATION
TECHNIQUES
8

SUSHIL KULKARNI

CLASSIFICATION
TECHNIQUES
Regression
Distance
Decision Trees
Rules
Neural Networks
9

SUSHIL KULKARNI

CLASSIFICATION TECHNIQUES
Approach:
Create specific model by evaluating
training data (or using domain experts
knowledge)
Apply model developed to new data.

SUSHIL KULKARNI

CLASSIFICATION TECHNIQUES
Classes must be predefined
Most common techniques use DTs, NNs,
or are based on distances or statistical
methods.

SUSHIL KULKARNI

DEFINE CLASSES
Distance Based

Partitioning Based

SUSHIL KULKARNI

ISSUES IN CLASSIFICATION
View letters as constructed from 5
components:
Missing Data
1. Ignore
2. Replace with assumed value
Measuring Performance
1. Classification accuracy on test data
2. Confusion matrix
133. OC Curve
SUSHIL KULKARNI

INTENSIONS

How one can find the performance that can be

measured to do the classification of data?

Explain Operating Characteristic curve.

Define confusion matrix.
How regression is used to classify the data?
What are the two different approaches in
classification using regression?

How correlation is used in classification of data?

What is Bayes theorem?
Explain with example.
14

SUSHIL KULKARNI
SUSHIL KULKARNI

PERFORMANCE
MEASURE
15

SUSHIL KULKARNI

HEIGHT EXAMPLE DATA

SUSHIL KULKARNI

MEASURING PERFORMANCE IN
CLASSIFICATION
C j is a specific class and t I is a database
tuple, may or may not be assigned to that
class while its actual membership may or
may not be in mat class. This gives four
parts as shown below:

SUSHIL KULKARNI

MEASURING PERFORMANCE IN
CLASSIFICATION

1.True Positive: t i predicted to be in c j and is

actually in it.
2. False Positive : t i predicted to be in c j but
is not actually in it.
3. True Negative : t I not predicted to be in c j
and is not actually in it.
4. False Negative : t i not predicted to be in
18c j but actually in it.
SUSHIL KULKARNI

CLASSIFICATION
PERFORMANCE

True Positive

False Negative

False Positive

True Negative
SUSHIL KULKARNI

OPERATING CHARECTERISTIC
CURVE
It shows the relation ship between false
positives and true positives
OC curve was originally used to examine
false alarm rates.

SUSHIL KULKARNI

OPERATING CHARACTERISTIC
CURVE

SUSHIL KULKARNI

CONFUSION MATRIX
It illustrates the accuracy of solution to a
classification problem
Definition:
Given m classes, a confusion matrix is an m
by m matrix where each entry indicates the
number of tuples from D that were assigned
to class C j but where correct class is C i
22

SUSHIL KULKARNI

CONFUSION MATRIX
EXAMPLE
Using height data example with Output 1
correct and Output 2 actual assignment

SUSHIL KULKARNI

STATISTICAL
BASED
ALGORITHMS
24

SUSHIL KULKARNI

REGRESSION
Assume data fits a predefined function
Determine best values for regression
coefficients c 0,c1,,cn.
Linear Regression:
y = c0+ c1x1++ cnxn
Assume an error: y = c0+ c1x1++ cnxn+ e
25

SUSHIL KULKARNI

Linear Regression

SUSHIL KULKARNI

LINEAR REGRESSION : Poor Fit

SUSHIL KULKARNI

CLASSIFICATION USING
REGRESSION
Division: Use regression function to
divide area into regions.
Prediction: Use regression function to
predict a class membership function.
Input includes desired class.

SUSHIL KULKARNI

DIVISION

SUSHIL KULKARNI

PREDICTION

SUSHIL KULKARNI

CORRELATION
Examine the degree to which the
values for two variables behave
similarly.
Correlation coefficient r:
1 = perfect correlation
-1 = perfect but opposite correlation
0 = no correlation
31

SUSHIL KULKARNI

BAYES THEOREM
Posterior Probability: P(h1|xi)
Prior Probability: P(h1)
Bayes Theorem:

Assign probabilities of hypotheses given a

data value.
32

SUSHIL KULKARNI

BAYES THEOREM EXAMPLE

Credit authorizations (hypotheses):
h1=authorize purchase, h2 = authorize after
further identification, h3=do not authorize,
h4= do not authorize but contact police
Assign twelve data values for all
combinations of credit and income:

From training data: P(h1) = 60%;

P(h2)=20%; P(h3)=10%; P(h4)=10%.
33

SUSHIL KULKARNI

Bayes Example(contd)
Training Data:

SUSHIL KULKARNI

INTENSIONS

Explain different distance bases algorithms.

Explain similarity measures between data using

distances.

How distance is useful in classification of data?

Explain KNN in detail

SUSHIL KULKARNI
SUSHIL KULKARNI

DISTANCE BASED
ALGORITHM
36

SUSHIL KULKARNI

SIMILARITY MEASURES
Determine similarity between two objects.
Similarity characteristics:

Alternatively, distance measure measure

how unlike or dissimilar objects are.

SUSHIL KULKARNI

SIMILARITY MEASURES
Similarity characteristics:

Sim( t i, t i ) = 1
SIMILARITY
Sim( t i, t j ) = 0
No SIMILARITY

SUSHIL KULKARNI

SIMILARITY MEASURES

SUSHIL KULKARNI

CLASSIFICATION USING
DISTANCE
Place items in class to which they are
closest

SUSHIL KULKARNI

CLASSIFICATION USING
DISTANCE
Must determine distance between an item
and a class.

SUSHIL KULKARNI

DISTANCE MEASURES
Measure dissimilarity between objects

SUSHIL KULKARNI

CLASSIFICATION USING
DISTANCE
Classes represented by
1. Centroid: Central value.
2. Medoid: Representative point.
3. Individual points
Algorithm: KNN
43

SUSHIL KULKARNI

K-NEAREST NEIGHBOUR (KNN)

Training set includes classes.
Examine K items near item to be
classified.
New item placed in class with the most
number of close items.
O(q) for each tuple to be classified.
(Here q is the size of the training set.)
44

SUSHIL KULKARNI

KNN

SUSHIL KULKARNI

KNN ALGORITHM

SUSHIL KULKARNI

DECISION TREE
47

SUSHIL KULKARNI

DECISION TREE
Tree where the root and each internal
node is labeled with a question.
The arcs represent each possible
answer to the associated question.
Each leaf node represents a
prediction of a solution to the
problem.
48

SUSHIL KULKARNI

DECISION TREE
Popular technique for classification;
Leaf node indicates class to which the
corresponding tuple belongs.

SUSHIL KULKARNI

DECISION TREE: Example

SUSHIL KULKARNI

DECISION TREE
Given:
D = {t1, , tn} where ti=<ti1, , tih>
Database schema contains
{A1, A2, , Ah}
Classes C={C1, ., Cm}

SUSHIL KULKARNI

DECISION TREE
Decision or Classification Tree is a tree
associated with D such that
Each internal node is labeled with
attribute, Ai
Each arc is labeled with predicate
which can be applied to attribute at
parent

Each leaf node is labeled with a class,

Cj
SUSHIL KULKARNI

DECISION TREES MODEL

A Decision Tree Model is a
computational model consisting of
three parts:
Decision Tree
Algorithm to create the tree
Algorithm that applies the tree to
data
53

SUSHIL KULKARNI

DECISION TREES MODEL

Creation of the tree is the most
difficult part.
Processing is basically a search
similar to that in a binary search tree
(although DT may not be binary).
54

SUSHIL KULKARNI

Decision Tree Algorithm

SUSHIL KULKARNI

DIRECTED TREE :
ADVANTAGES
Easy to understand.
Easy to generate rules

SUSHIL KULKARNI

DIRECTED TREE :
DISADVANTAGES
May suffer from over fitting.
Classifies by rectangular partitioning.
Does not easily handle nonnumeric
data.
Can be quite large pruning is
necessary.
57

SUSHIL KULKARNI

CLASSIFICATION USING
DECISION TREE
Partitioning based: Divide search
space into rectangular regions.
Tuple placed into class based on the
region within which it falls.
DT approaches differ in how the tree is
built: DT Induction
58

SUSHIL KULKARNI

CLASSIFICATION USING
DECISION TREE
Internal nodes associated with attribute
and arcs with values for that attribute.
Algorithms: ID3, C4.5, CART

SUSHIL KULKARNI

DT INDUCTION

SUSHIL KULKARNI

DT SPLIT AREA

Gender

M
F
Height

SUSHIL KULKARNI

COMPARING DTs

Balanced

Deep
SUSHIL KULKARNI

DT ISSUES
Choosing Splitting Attributes
Ordering of Splitting Attributes
Splits
Tree Structure
Stopping Criteria
Training Data
63

SUSHIL KULKARNI

Decision Tree Induction is often based

on Information Theory
So

SUSHIL KULKARNI

INFORMATION

SUSHIL KULKARNI

DT INDUCTION
When all the marbles in the bowl are
mixed up, little information is given.
When the marbles in the bowl are all
from one class and those in the other
two classes are on either side, more
information is given.
Use this approach with DT Induction !
66

SUSHIL KULKARNI

ARTIFICIAL NEURAL
NETWORK (ANN)
ANN is an information processing
paradigm that is inspired by the way
brain process information.
Composed of a large number of highly
interconnected
processing
elements
called neurones.
ANNs, like people, learn by example.
67

SUSHIL KULKARNI

ARTIFICIAL NEURAL
NETWORK (ANN)
Learning in biological systems involves
adjustments to the synaptic connections
that exist between the neurones.
This is true of ANNs as well.

SUSHIL KULKARNI

HOW HUMAN BRAIN LEARNS?

SUSHIL KULKARNI

Cont..

SUSHIL KULKARNI

HOW HUMAN BRAIN LEARNS?

In the human brain, a typical neuron
collects signals from others through a
host of fine structures called dendrites.
The neuron sends out spikes of electrical
activity through a long, thin stand known
as an axon, which splits into thousands
of branches.

SUSHIL KULKARNI

HOW HUMAN BRAIN LEARNS?

At the end of each branch, a structure
called a synapse converts the activity from
the axon into electrical effects that inhibit
or excite activity from the axon into
electrical effects that inhibit or excite
activity in the connected neurones.
Learning
occurs
by
changing
the
effectiveness of the synapses so that the
influence of one neuron on another
changes.
72

SUSHIL KULKARNI

A SIMPLE NEURON

SUSHIL KULKARNI

NEURAL NETWORKS
Based on observed functioning of
human brain.
(Artificial Neural Networks (ANN)
The first artificial neuron was
produced
in
1943
by
the
neurophysiologist Warren McCulloch
and the logician Walter Pits.

SUSHIL KULKARNI

NEURAL NETWORKS
Our view of neural networks is very
simplistic.
We view a neural network (NN) from a
graphical viewpoint.
Used in pattern recognition, speech
recognition, computer vision, and
classification.
75

SUSHIL KULKARNI

NEURAL NETWORKS: Example

SUSHIL KULKARNI

NEURAL NETWORKS
It is a directed graph F=<V,A> with vertices
V={1,2,,n} and arcs A={<i,j>|1<=i,j<=n},
with the following restrictions:
V is partitioned into a set of input nodes, V I,
hidden nodes, VH, and output nodes, VO.

The vertices are also partitioned into

layers
77

SUSHIL KULKARNI

NEURAL NETWORKS
Any arc <i,j> must have node i in layer
h-1 and node j in layer h.
Arc <i,j> is labeled with a numeric value
wij.
Node i is labeled with a function fi.

SUSHIL KULKARNI

NN NODE

SUSHIL KULKARNI

NEURAL NETWORK MODEL

It is a computational model consisting of
Three parts:

Neural Network graph

Learning algorithm that indicates
how learning takes place.
Recall techniques that determine
how information is obtained from the
network.
80

SUSHIL KULKARNI

NN : Advantages
Learning
Can continue learning even after training
set has been applied.
Easy parallelization
Solves many problems

SUSHIL KULKARNI

NN: Disadvantages
Difficult to understand
May suffer from overfitting
Structure of graph must be determined a
priori.
Input values must be numeric.
Verification difficult.

SUSHIL KULKARNI

CLASSIFICATION USING
NEURAL NETWORKS
Typical NN structure for classification:
1. One output node per class
2.Output value is class membership
function value
Supervised learning
83

SUSHIL KULKARNI

CLASSIFICATION USING
NEURAL NETWORKS
For each tuple in training set, propagate it
through NN. Adjust weights on edges to
improve future classification.
Algorithms: Propagation, Back
propagation, Gradient Descent
84

SUSHIL KULKARNI

NN ISSUES
Number of source nodes
Number of hidden layers
Training data
Number of sinks
Interconnections
85

SUSHIL KULKARNI

NN ISSUES
Weights
Activation Functions
Learning Technique
When to stop learning
86

SUSHIL KULKARNI

DECISION TREE VS. NEURAL

NETWORK

SUSHIL KULKARNI

PRPOGATION

Tuple Input
Output

SUSHIL KULKARNI

NN PROPOGATION ALGORITHM

SUSHIL KULKARNI

EXAMPLE PROPOGATION

SUSHIL KULKARNI

RULES
91

SUSHIL KULKARNI

CLASSIFICATION USING RULES

Perform classification using If-Then
rules
Classification Rule: r = <a,c>
Antecedent, Consequent
May generate from from other
techniques (DT, NN) or generate
directly.
Algorithms: Gen, RX, 1R, PRISM
92

SUSHIL KULKARNI

GENERATING RULES FROM

DTs

SUSHIL KULKARNI

GENERATING RULES EXAMPLE

SUSHIL KULKARNI

GENERATING RULES FROM NNs

SUSHIL KULKARNI

1R ALGORITHM

SUSHIL KULKARNI

1R EXAMPLE

SUSHIL KULKARNI

PRISM ALGORITHM

SUSHIL KULKARNI

PRISM EXAMPLE

SUSHIL KULKARNI

DECISION TREE VS. RULES

Tree has implied
order in which
splitting is
performed.
Tree created
based on looking
at all classes.

100

Rules have no
ordering of
predicates.
Only need to look
at one class to
generate its rules.

SUSHIL KULKARNI

INTENSIONS

Clustering Examples
Segment customer database based on
similar buying patterns.
Group houses in a town into
neighborhoods based on similar features.
Identify new plant species
Identify similar Web usage patterns

101

SUSHIL KULKARNI

CLUSTERING
102

SUSHIL KULKARNI

CLUSTERING : Example

103

SUSHIL KULKARNI

CLUSTERING HOUSES

Geographic
Size
Distance
Based Based
104

SUSHIL KULKARNI

Clustering vs. Classification

No prior knowledge
Number of clusters
Meaning of clusters

Unsupervised learning

105

SUSHIL KULKARNI

Clustering Issues
Outlier handling
Dynamic data
Interpreting results
Evaluating results
Number of clusters
Data to be used
Scalability
106

SUSHIL KULKARNI

Impact of Outliers on
Clustering

107

SUSHIL KULKARNI

Clustering Problem
Given a database D={t1,t2,,tn} of tuples
and an integer value k, the Clustering
Problem is to define a mapping f:D{1,..,k}
where each ti is assigned to one cluster Kj,
1<=j<=k.
A Cluster, Kj, contains precisely those
tuples mapped to it.
Unlike classification problem, clusters are
not known a priori.
108

SUSHIL KULKARNI

Types of Clustering
Hierarchical Nested set of clusters
created.
Partitional One set of clusters created.
Incremental Each element handled one
at a time.
Simultaneous All elements handled
together.
Overlapping/Non-overlapping
109

SUSHIL KULKARNI

Clustering Approaches
Clustering

Hierarchical

Agglomerative

110

Partitional

Divisive

Categorical

Sampling

Large DB

Compression

SUSHIL KULKARNI

Cluster Parameters

111

SUSHIL KULKARNI

Distance Between Clusters

Single Link: smallest distance between points
Complete Link: largest distance between
points
Average Link: average distance between
points
Centroid: distance between centroids

112

SUSHIL KULKARNI

Hierarchical Clustering
Clusters are created in levels actually creating
sets of clusters at each level.
Agglomerative
Initially each item in its own cluster
Iteratively clusters are merged together
Bottom Up

Divisive
Initially all items in one cluster
Large clusters are successively divided
Top Down
113

SUSHIL KULKARNI

Hierarchical Algorithms
Single Link
MST Single Link
Complete Link
Average Link

114

SUSHIL KULKARNI

Dendrogram
Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
Each level shows clusters for
that level.
Leaf individual clusters
Root one cluster

A cluster at level i is the union

of its children clusters at level
i+1.
115

SUSHIL KULKARNI

Agglomerative Example
A B C D E
A

D
Threshold of
1 2 34 5

A B C D E
116

SUSHIL KULKARNI

MST Example
A

A B C D E

117

SUSHIL KULKARNI

118

SUSHIL KULKARNI

Agglomerative Algorithm

119

SUSHIL KULKARNI

Single Link
View all items with links (distances)
between them.
Finds maximal connected components
in this graph.
Two clusters are merged if there is at
least one edge which connects them.
Uses threshold distances at each level.
Could be agglomerative or divisive.
120

SUSHIL KULKARNI

MST Single Link Algorithm

121

SUSHIL KULKARNI

Single Link Clustering

122

SUSHIL KULKARNI

Partitional Clustering
Nonhierarchical
Creates clusters in one step as opposed
to several steps.
Since only one set of clusters is output,
the user normally has to input the desired
number of clusters, k.
Usually deals with static sets.
123

SUSHIL KULKARNI

Partitional Algorithms
MST
Squared Error
K-Means
Nearest Neighbor
PAM
BEA
GA
124

SUSHIL KULKARNI

K-Means
Initial set of clusters randomly chosen.
Iteratively, items are moved among sets
of clusters until the desired set is
reached.
High degree of similarity among
elements in a cluster is obtained.
Given a cluster Ki={ti1,ti2,,tim}, the
cluster mean is mi = (1/m)(ti1 + + tim)
125

SUSHIL KULKARNI

K-Means Example
Given: {2,4,10,12,3,20,30,11,25}, k=2
Randomly assign means: m1=3,m2=4
K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
K1={2,3,4},K2={10,12,20,30,11,25},
m1=3,m2=18
K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
K1={2,3,4,10,11,12},K2={20,30,25},
m1=7,m2=25
Stop as the clusters with these means are the
same.
126

SUSHIL KULKARNI

K-Means Algorithm

127

SUSHIL KULKARNI

Nearest Neighbor
Items are iteratively merged into the
existing clusters that are closest.
Incremental
Threshold, t, used to determine if items
are added to existing clusters or a new
cluster is created.

128

SUSHIL KULKARNI

Nearest Neighbor Algorithm

129

SUSHIL KULKARNI

Clustering Large Databases

Most clustering algorithms assume a large
data structure which is memory resident.
Clustering may be performed first on a
sample of the database then applied to the
entire database.
Algorithms
BIRCH
DBSCAN
CURE
130

SUSHIL KULKARNI

Desired Features for Large

Databases
One scan (or less) of DB
Online
Suspendable, stoppable, resumable
Incremental
Work with limited main memory
Different techniques to scan (e.g.
sampling)
Process each tuple once
131

SUSHIL KULKARNI

BIRCH
Balanced Iterative Reducing and
Clustering using Hierarchies
Incremental, hierarchical, one scan
Save clustering information in a tree
Each entry in the tree contains
information about one cluster
New nodes inserted in closest entry in
tree
132

SUSHIL KULKARNI

Clustering Feature
CT Triple: (N,LS,SS)
N: Number of points in cluster
LS: Sum of points in the cluster
SS: Sum of squares of points in the cluster
CF Tree
Balanced search tree
Node has CF triple for each child
Leaf node represents cluster and has CF value
for each subcluster in it.
Subcluster has maximum diameter
133

SUSHIL KULKARNI

CURE
Clustering Using Representatives
Use many points to represent a cluster
instead of only one
Points will be well scattered

134

SUSHIL KULKARNI

CURE Approach

135

SUSHIL KULKARNI

CURE Algorithm

136

SUSHIL KULKARNI

CURE for Large Databases

137

SUSHIL KULKARNI

ASSOCIATION
RULES
138

SUSHIL KULKARNI

Association Rules Outline

Goal: Provide an overview of basic
Association Rule mining techniques
Association Rules Problem Overview
Large itemsets

Association Rules Algorithms

Apriori
Sampling
Partitioning
Parallel Algorithms

Comparing Techniques
Incremental Algorithm
Advanced AR Techniques
139

SUSHIL KULKARNI

Example: Market Basket Data

Items frequently purchased together:
Bread Butter

Uses:
Placement
Advertising
Sales
Coupons

Objective: increase sales and reduce

costs
140

SUSHIL KULKARNI

Association Rule Definitions

Set of items: I={I1,I2,,Im}
Transactions: D={t1,t2, , tn}, tj I
Itemset: {Ii1,Ii2, , Iik} I
Support of an itemset: Percentage of
transactions which contain that itemset.
Large (Frequent) itemset: Itemset whose
number of occurrences is above a
threshold.
141

SUSHIL KULKARNI

Association Rules Example

I = { Beer, Bread, Jelly, Milk, PeanutButter}

Support of {Bread,PeanutButter} is 60%
142

SUSHIL KULKARNI

Association Rule Definitions

Association Rule (AR): implication
X Y where X,Y I and X Y = ;
Support of AR (s) X Y: Percentage
of transactions that contain X Y
Confidence of AR ( ) X Y: Ratio of
number of transactions that contain
X Y to the number that contain X
143

SUSHIL KULKARNI

Association Rules Ex (contd)

144

SUSHIL KULKARNI

Association Rule Problem

Given a set of items I={I1,I2,,Im} and a
database of transactions D={t1,t2, , tn}
where ti={Ii1,Ii2, , Iik} and Iij I, the
Association Rule Problem is to
identify all association rules X Y with
a minimum support and confidence.
Link Analysis
NOTE: Support of X Y is same as
support of X Y.
145

SUSHIL KULKARNI

Association Rule Techniques

1. Find Large Itemsets.
2. Generate rules from frequent
itemsets.

146

SUSHIL KULKARNI

Algorithm to Generate ARs

147

SUSHIL KULKARNI

Apriori
Large Itemset Property:
Any subset of a large itemset is large.
Contrapositive:
If an itemset is not large, none of its
supersets are large.

148

SUSHIL KULKARNI

Large Itemset Property

149

SUSHIL KULKARNI

Apriori Ex (contd)

s=30%
150

= 50%
SUSHIL KULKARNI

Apriori Algorithm
1. C1 = Itemsets of size one in I;
2. Determine all large itemsets of size 1, L1;
3. i = 1;
4. Repeat
5.
i = i + 1;
6.
Ci = Apriori-Gen(Li-1);
7.

Count Ci to determine Li;

8. until no more large itemsets found;

151

SUSHIL KULKARNI

Apriori-Gen
Generate candidates of size i+1 from large
itemsets of size i.
Approach used: join large itemsets of size
i if they agree on i-1
May also prune candidates who have
subsets that are not large.

152

SUSHIL KULKARNI

Apriori-Gen Example

153

SUSHIL KULKARNI

Apriori-Gen Example (contd)

154

SUSHIL KULKARNI

Apriori Adv/Disadv
Advantages:
Uses large itemset property.
Easily parallelized
Easy to implement.

Disadvantages:
Assumes transaction database is memory
resident.
Requires up to m database scans.
155

SUSHIL KULKARNI

Sampling
Large databases
Sample the database and apply Apriori to the
sample.
Potentially Large Itemsets (PL): Large
itemsets from sample
Negative Border (BD - ):
Generalization of Apriori-Gen applied to
itemsets of varying sizes.
Minimal set of itemsets which are not in PL,
but whose subsets are all in PL.
156

SUSHIL KULKARNI

Negative Border Example

PL
157

PL BD-(PL)
SUSHIL KULKARNI

Sampling Algorithm
1.
2.
3.
4.
5.
6.
7.
8.
158

Ds = sample of Database D;
PL = Large itemsets in Ds using smalls;
C = PL BD-(PL);
Count C in Database using s;
ML = large itemsets in BD-(PL);
If ML = then done
else C = repeated application of BD-;
Count C in Database;
SUSHIL KULKARNI

Sampling Example
Find AR assuming s = 20%
Ds = { t1,t2}
Smalls = 10%
PL = {{Bread}, {Jelly}, {PeanutButter},
{Bread,Jelly}, {Bread,PeanutButter}, {Jelly,
PeanutButter}, {Bread,Jelly,PeanutButter}}
BD-(PL)={{Beer},{Milk}}
ML = {{Beer}, {Milk}}
Repeated application of BD- generates all
remaining itemsets
159

SUSHIL KULKARNI

Sampling Adv/Disadv
Advantages:
Reduces number of database scans to one
in the best case and two in worst.
Scales better.

Disadvantages:
Potentially large number of candidates in
second pass

160

SUSHIL KULKARNI

Partitioning
Divide database into partitions D1,D2,
,Dp
Apply Apriori to each partition
Any large itemset must be large in at
least one partition.

161

SUSHIL KULKARNI

Partitioning Algorithm
1.
2.
3.
4.
5.

162

Divide D into partitions D1,D2,,Dp;

For I = 1 to p do
Li = Apriori(Di);
C = L1 Lp;
Count C on D to generate L;

SUSHIL KULKARNI

Partitioning Example
L1 ={{Bread}, {Jelly},
{PeanutButter},
{Bread,Jelly},
{Bread,PeanutButter},
{Jelly, PeanutButter},
{Bread,Jelly,PeanutButter}}

S=10%
163

L2 ={{Bread}, {Milk},
{PeanutButter}, {Bread,Milk},
{Bread,PeanutButter}, {Milk,
PeanutButter},
{Bread,Milk,PeanutButter},
{Beer}, {Beer,Bread},
{Beer,Milk}}
SUSHIL KULKARNI

Partitioning Adv/Disadv
Advantages:
Adapts to available main memory
Easily parallelized
Maximum number of database scans is
two.

Disadvantages:
May have many candidates during second
scan.
164

SUSHIL KULKARNI

Parallelizing AR Algorithms
Based on Apriori
Techniques differ:
What is counted at each site
How data (transactions) are distributed

Data Parallelism
Data partitioned
Count Distribution Algorithm

Task Parallelism
Data and candidates partitioned
Data Distribution Algorithm
165

SUSHIL KULKARNI

T H A N K S !

166

SUSHIL KULKARNI

Coding Interview Patterns - Nails your interview - Alex Xu - 2024
67% (3)
Coding Interview Patterns - Nails your interview - Alex Xu - 2024
154 pages
ISO 27001 Detailing
100% (1)
ISO 27001 Detailing
34 pages
Uoit Calculus 2 Midterm
No ratings yet
Uoit Calculus 2 Midterm
26 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
Classificationi 4
No ratings yet
Classificationi 4
4 pages
Data Mining Assignment 3
No ratings yet
Data Mining Assignment 3
9 pages
Bayesian_Classifier_ML_LECT_9_10_11_12
No ratings yet
Bayesian_Classifier_ML_LECT_9_10_11_12
53 pages
Classification
No ratings yet
Classification
50 pages
Classification: Unit-III
No ratings yet
Classification: Unit-III
90 pages
CS402 Mod 3
No ratings yet
CS402 Mod 3
2 pages
Unit 2
No ratings yet
Unit 2
55 pages
6.Data Mining - Classification Ppt
No ratings yet
6.Data Mining - Classification Ppt
37 pages
Lectures 7 and 8 - Data Anaysis in Management - MBM
No ratings yet
Lectures 7 and 8 - Data Anaysis in Management - MBM
78 pages
Unit-4 AML (1. Basics and K-NN)
No ratings yet
Unit-4 AML (1. Basics and K-NN)
25 pages
Week 4 Part 1 Classification
No ratings yet
Week 4 Part 1 Classification
71 pages
07_BayesianClassifier
No ratings yet
07_BayesianClassifier
56 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
50 pages
Data Mining 4th Is
No ratings yet
Data Mining 4th Is
24 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
7.classification Before
No ratings yet
7.classification Before
27 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Chapter
100% (1)
Chapter
101 pages
Classification
No ratings yet
Classification
33 pages
UNIT 2 & 3
No ratings yet
UNIT 2 & 3
59 pages
CH 4 CLass
No ratings yet
CH 4 CLass
25 pages
Naïve Bayes Classifier
No ratings yet
Naïve Bayes Classifier
39 pages
Unit Iii Classification
No ratings yet
Unit Iii Classification
57 pages
08 - Classification - Decision Trees
No ratings yet
08 - Classification - Decision Trees
116 pages
05 Classification Part1
No ratings yet
05 Classification Part1
35 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
Unit-4
No ratings yet
Unit-4
52 pages
Lect 1
No ratings yet
Lect 1
38 pages
classification
No ratings yet
classification
36 pages
Data User 0 Com - Microsoft.office - Officehubrow Files Tempoffice OfficeMobilePdf DWDM UNIT-4
No ratings yet
Data User 0 Com - Microsoft.office - Officehubrow Files Tempoffice OfficeMobilePdf DWDM UNIT-4
81 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
Unit 3 (DWDM)
No ratings yet
Unit 3 (DWDM)
23 pages
4 - Data Analytics Using DM and ML Algorithms - 1
No ratings yet
4 - Data Analytics Using DM and ML Algorithms - 1
71 pages
Classification
No ratings yet
Classification
58 pages
Unit-6: Classification and Prediction
No ratings yet
Unit-6: Classification and Prediction
63 pages
DW&M Unit 3 Part I
No ratings yet
DW&M Unit 3 Part I
101 pages
Lecture 3 Basics of Clssification
No ratings yet
Lecture 3 Basics of Clssification
53 pages
19-Introduction classification algorithm-18-09-2024
No ratings yet
19-Introduction classification algorithm-18-09-2024
102 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
Data Mining
No ratings yet
Data Mining
33 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
Data Classification - Algorithms and Applications-Chapman and Hall - CRC (2014) - (Chapman & Hall - CRC Data Mining and Knowledge Discovery Series) Charu C. Aggarwal PDF
100% (1)
Data Classification - Algorithms and Applications-Chapman and Hall - CRC (2014) - (Chapman & Hall - CRC Data Mining and Knowledge Discovery Series) Charu C. Aggarwal PDF
704 pages
08 Class Basic
No ratings yet
08 Class Basic
141 pages
DataClassification
No ratings yet
DataClassification
65 pages
Classification DMKD
No ratings yet
Classification DMKD
50 pages
Unit 4
No ratings yet
Unit 4
186 pages
Data Mining - Classification - Lecture04
No ratings yet
Data Mining - Classification - Lecture04
21 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Data Mining Chapter
No ratings yet
Data Mining Chapter
6 pages
Problem 1: Cse352 AI Homework 3 Solutions
No ratings yet
Problem 1: Cse352 AI Homework 3 Solutions
31 pages
L05 - Advance Analytical Theory and Methods - Classification
No ratings yet
L05 - Advance Analytical Theory and Methods - Classification
34 pages
SupervisedLearning_Classification
No ratings yet
SupervisedLearning_Classification
20 pages
5.classification and Prediction
No ratings yet
5.classification and Prediction
9 pages
08ClassBasic
No ratings yet
08ClassBasic
154 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Crunching Calculus- An Introduction to Calculus
From Everand
Crunching Calculus- An Introduction to Calculus
Vinayak Singh Oberoi
No ratings yet
BGP Rib Failure
No ratings yet
BGP Rib Failure
4 pages
Vuukle - Digital Publishing - Why A Good Social Interaction Is A Must in The Digital Age - Ila
No ratings yet
Vuukle - Digital Publishing - Why A Good Social Interaction Is A Must in The Digital Age - Ila
4 pages
Lab 4
No ratings yet
Lab 4
9 pages
Introscope Metrics
No ratings yet
Introscope Metrics
3 pages
Shift Registers-1
No ratings yet
Shift Registers-1
11 pages
Scopus Indexed Journals
No ratings yet
Scopus Indexed Journals
2 pages
DAA Submission
No ratings yet
DAA Submission
95 pages
Takeing Cold and Hot Backup
No ratings yet
Takeing Cold and Hot Backup
5 pages
Robert - Conrad Resume
No ratings yet
Robert - Conrad Resume
2 pages
Basics of Reverse Engineering
No ratings yet
Basics of Reverse Engineering
37 pages
Complexity Analysis Introduction
No ratings yet
Complexity Analysis Introduction
3 pages
Important Points For Dealing With Navigational Warnings On Ships
No ratings yet
Important Points For Dealing With Navigational Warnings On Ships
13 pages
Service Book 91129655
No ratings yet
Service Book 91129655
7 pages
JavaScript Domain Driven Design Speed up your application development by leveraging the patterns of domain driven design 1st Edition Philipp Fehre 2024 Scribd Download
100% (2)
JavaScript Domain Driven Design Speed up your application development by leveraging the patterns of domain driven design 1st Edition Philipp Fehre 2024 Scribd Download
40 pages
VoLTE Training 2017 NAM
100% (1)
VoLTE Training 2017 NAM
104 pages
Data Modeling Using The Entity-Relationship (ER) Model
100% (1)
Data Modeling Using The Entity-Relationship (ER) Model
55 pages
Build Prop Bak Bak
No ratings yet
Build Prop Bak Bak
7 pages
RBC Case
No ratings yet
RBC Case
8 pages
Report Writing
No ratings yet
Report Writing
14 pages
Your Electronic Ticket Receipt PDF
No ratings yet
Your Electronic Ticket Receipt PDF
2 pages
ASP Quarterly Report Forms
No ratings yet
ASP Quarterly Report Forms
16 pages
12th Maths Answer Key For Quarterly Exam 2019 Question Paper English Medium PDF
No ratings yet
12th Maths Answer Key For Quarterly Exam 2019 Question Paper English Medium PDF
8 pages
Stan Ulam, John Von Neumann, Monte Carlo Method: and The
No ratings yet
Stan Ulam, John Von Neumann, Monte Carlo Method: and The
13 pages
Enterprise Architecture Testing
No ratings yet
Enterprise Architecture Testing
73 pages
Array:: Example
No ratings yet
Array:: Example
9 pages
Datasheet ECM 5158 Interface 4pgv1 A80401 Press
No ratings yet
Datasheet ECM 5158 Interface 4pgv1 A80401 Press
4 pages
NguyenTriDan - AI Engineering Intern - CV
No ratings yet
NguyenTriDan - AI Engineering Intern - CV
1 page

Data Mining Algorithmes

Uploaded by

Data Mining Algorithmes

Uploaded by

DATA MINING

Define classification problem using map and

What are the different techniques to classify the

List the approach in classification.

What are the different issues faced in doing

If x >= 90 then grade =A.

How one can find the performance that can be

Explain Operating Characteristic curve.

How correlation is used in classification of data?

HEIGHT EXAMPLE DATA

1.True Positive: t i predicted to be in c j and is

LINEAR REGRESSION : Poor Fit

Assign probabilities of hypotheses given a

BAYES THEOREM EXAMPLE

From training data: P(h1) = 60%;

Explain different distance bases algorithms.

Explain similarity measures between data using

How distance is useful in classification of data?

Explain KNN in detail

Alternatively, distance measure measure

K-NEAREST NEIGHBOUR (KNN)

DECISION TREE: Example

Each leaf node is labeled with a class,

DECISION TREES MODEL

DECISION TREES MODEL

Decision Tree Algorithm

Decision Tree Induction is often based

HOW HUMAN BRAIN LEARNS?

HOW HUMAN BRAIN LEARNS?

HOW HUMAN BRAIN LEARNS?

NEURAL NETWORKS: Example

The vertices are also partitioned into

NEURAL NETWORK MODEL

Neural Network graph

DECISION TREE VS. NEURAL

CLASSIFICATION USING RULES

GENERATING RULES FROM

GENERATING RULES EXAMPLE

GENERATING RULES FROM NNs

DECISION TREE VS. RULES

Clustering vs. Classification

Distance Between Clusters

A cluster at level i is the union

MST Single Link Algorithm

Single Link Clustering

Nearest Neighbor Algorithm

Clustering Large Databases

Desired Features for Large

CURE for Large Databases

Association Rules Outline

Association Rules Algorithms

Example: Market Basket Data

Objective: increase sales and reduce

Association Rule Definitions

Association Rules Example

I = { Beer, Bread, Jelly, Milk, PeanutButter}

Association Rule Definitions

Association Rules Ex (contd)

Association Rule Problem

Association Rule Techniques

Algorithm to Generate ARs

Large Itemset Property

Count Ci to determine Li;

8. until no more large itemsets found;

Apriori-Gen Example (contd)

Negative Border Example

Divide D into partitions D1,D2,,Dp;

You might also like