0% found this document useful (0 votes)
50 views82 pages

Unit Ii DM

The document discusses various data mining techniques including association rule mining and clustering. Association rule mining finds interesting relationships between variables in large databases. The Apriori algorithm is a popular method that uses support and confidence metrics. Clustering groups similar objects together, with hierarchical and partitioning being the main approaches. K-means and k-medoid algorithms are partitioning methods, while PAM, CLARA and CLARANS are specific k-medoid algorithms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views82 pages

Unit Ii DM

The document discusses various data mining techniques including association rule mining and clustering. Association rule mining finds interesting relationships between variables in large databases. The Apriori algorithm is a popular method that uses support and confidence metrics. Clustering groups similar objects together, with hierarchical and partitioning being the main approaches. K-means and k-medoid algorithms are partitioning methods, while PAM, CLARA and CLARANS are specific k-medoid algorithms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 82

DATA MINING

UNIT - II
UNIT - II

Association Rule: Introduction-Methods in association rule-Apriori


algorithm. Clustering: Introduction- Clustering paradigms-Partition
algorithm-K-medoid algorithms- CLARA- CLARANS - Hierarchical
clustering-DBSCAN-BRICH-CURE.
ASSOCIATION RULE
Introduction
Association rule learning is a rule-based machine learning method for discovering
interesting relations between variables in large databases.
It is intended to identify strong rules discovered in databases using some
measures of interestingness.
The Problem was formulated by Agrawal in 1993 and is often referred to as the
market-basket problem.
The problem is to analyze customer’s buying habits by finding associations
between the different items that customers place in their shopping baskets.
Association rules are frequently used by retail stores to assist in marketing,
advertising, floor placement, and inventory control.
WHAT IS ASSOCIATION RULE
An association rule, A=> B, will be of the form” for a set of transactions, some value of
itemset A determines the values of itemset B under the condition in which minimum support
and confidence are met”.
METHODS IN ASSOCIATION RULE
Association rule mining finds interesting association or correlation relationships among a large set of
data items.
Support and confidence: These terms are used to measure the quality of a given rule, in terms of its
usefulness (strength) and certainty.

Problem decomposition:
The problem of mining association rules can be decomposed into two sub problems.
Find all the frequent itemsets.
Generate association rules from the above frequent itemsets.
DEFINITIONS
Frequent set: The sets of item which has minimum support (denoted by Li for
ith-Itemset).
Downward Closure Property: Any subset of a frequent set must be frequent
set.
Upward Closure Property: Any superset of an infrequent set is an infrequent
set.
Maximal frequent set: A frequent set is a maximal frequent set if it is a
frequent set and no superset of this is a frequent set.
Border Set: An itemset is a border set if it is not a frequent set, but all its proper
subsets are frequent sets.
APRIORI ALGORITHM
Level wise algorithm.
Proposed by Agrawal and Srikant in 1994.
Uses downward closure property.
Bottom up search
CANDIDATE GENERATION
PRUNING
APRIORI ALGORITHM BY EXAMPLE
CLUSTERING

Cluster Analysis is the process to find similar groups of objects in


order to form clusters.
It is an unsupervised machine learning-based algorithm that acts
on unlabelled data.
A group of data points would comprise together to form a cluster
in which all the objects would belong to the same group.
Clustering Paradigms
There are two main approaches to clustering
Hierarchical clustering
Partitioning clustering
THE PARTITION CLUSTERING
The Partition clustering techniques partition the database into a predefined number of
clusters. They attempt to determine k partitions that optimise a certain criterion function.

Two types:
-k-means algorithms
-k-medoid algorithms

The hierarchical clustering techniques do a sequence of partitions, in which each


partition is nested into the next partition in the sequence.

Two types
-Agglomerative
-Divisive
AGGLOMERATIVE CLUSTERING
Agglomerative clustering techniques is a bottom-up approach,
initially, each data point is a cluster of its own, further pairs of
clusters are merged as one moves up the hierarchy.

Steps of Agglomerative Clustering:


Initially, all the data-points are a cluster of its own.
Take two nearest clusters and join them to form one single cluster.
Proceed recursively step 2 until you obtain the desired number of
clusters.
DIVISIVE CLUSTERING
Divisive clustering algorithm is a top-down clustering approach,
initially, all the points in the dataset belong to one cluster and split is
performed recursively as one moves down the hierarchy.

Steps of Divisive Clustering:


Initially, all points in the dataset belong to one single cluster.
Partition the cluster into two least similar cluster
Proceed recursively to form new clusters until the desired number of
clusters is obtained.
NUMERIC VS CATEGORICAL

Clustering can be performed on the both numerical data categorical data

Numerical data-
The geometric properties can be used to define the distances between the points.
Numerical data refers to the data that is in the form of numbers, and not in any language
or descriptive form.
It has ability to be statistically and arithmetically calculated

Categorical data-
Consists of categorical attributes ,on which distance functions are not naturally defined.
Categorical data refers to a data type that can be stored and identified based on the names
or labels given to them.
The data collected in the categorical form is also known as qualitative data.
PARTITIONING ALGORITHMS

The Partitioning clustering algorithm adopts the iterative Optimisation Paradigm.

It starts with an initial partition and uses an iterative control strategy.

Two main categories of partitioning algorithms


*k-means algorithms, where each cluster is represented by the centre of gravity of
the cluster.

*k-medoid algorithms, where each cluster is represented by one of the objects of


the clusters located near the centre.
There are three algorithms for K-medoids Clustering:
PAM (Partitioning around medoids)
CLARA (Clustering LARge Applications)
CLARANS ("Randomized" CLARA).

Among these PAM is known to be most powerful and considered to be used widely.
It cannot handle large volumes of data
However, PAM has a drawback due to its time complexity
K-MEDOIDS ALGORITHMS
PAM (Partition Around Medoids)
PAM uses a k-Medoid method to identify the clusters.
The algorithm has two important modules:
The Partitioning of the database for a given set of medoids
The Iterative selection of medoids.

Partitioning :
Oj – non-selected object
Oi- medoid
Oj belongs to Oi
If d(Oi,Oj) = Minoed(Oj,Oj) where minimum is taken over all medoids Oe and d(Oa,Ob)
determines the distance , or dissimilarity , between objects O a and Ob.
Disimilarity matrix is known in prior to commencement of PAM
K-MEDOIDS ALGORITHMS (CONTD)
Iterative selection of Medoids
The effect of swapping Oi and Oh is that an unselected object becomes
a medoid replacing an existing medoid.
The new set of k-medoids is Kmed’ = {O1,O2,..Oi-1,Oh,…Ok} where Oh
replaces the Oi as one of the medoids from Kmed={O1,O2,.…Ok}.
Now quality of clustering is compared
If the cost value is negative number, make swapping permanent then
next random selection of medoid is made.
If the cost value is positive number, undo the swap, then the optimized
clusters are formed
Let’s consider the following example:
If a graph is drawn using the above data points, we obtain the following:
Step #1: k = 2
Let the randomly selected 2 medoids be C1 -(3, 4) and C2 -(7, 4).
Step #2: Calculating cost.
The dissimilarity of each non-medoid point with the medoids is calculated and tabulated:
Each point is assigned to the cluster of that medoid whose dissimilarity is less.
The points 1, 2, 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.
The cost C = (3 + 4 + 4) + (3 + 1 + 1 + 2 + 2)
C = 20
Step #3: Now randomly select one non-medoid point and recalculate the cost.
Let the randomly selected point be (7, 3). The dissimilarity of each non-medoid point
with the medoids – C1 (3, 4) and C2 (7, 3) is calculated and tabulated.
Each point is assigned to that cluster whose dissimilarity is less. So, the points 1, 2, 5 go to cluster C1
and 0, 3, 6, 7, 8 go to cluster C2.
The cost C = (3 + 4 + 4) + (2 + 2 + 1 + 3 + 3)
C = 22
Swap Cost = Present Cost – Previous Cost
= 22 – 20 = 2 >0
As the swap cost is not less than zero, we undo the swap. Hence (3, 4) and (7, 4) are the final medoids.
The clustering would be in the following way
STEPS INVOLVED IN K-MEDOID ALGORITHM

STEP1: Initialize k clusters in the given data space D.


STEP2: Randomly choose k objects from n objects in data and assign k
objects to k clusters such that each object is assigned to one and only one
cluster. Hence, it becomes an initial medoid for each cluster.
STEP3: For all remaining non-medoid objects, compute the Cost(distance as
computed via Euclidean, Manhattan, or Chebyshev methods) from all
medoids.
STEP4: Now, Assign each remaining non-medoid object to that cluster whose
medoid distance to that object is minimum as compared to other clusters
medoid.
STEP5: Compute the total cost i.e. it is the total sum of all the non-
medoid objects distance from its cluster medoid and assign it to dj.
STEP6: Randomly select a non-medoid object i.
STEP7: Now, temporary swap the object i with medoid j and
Repeat STEP5 to recalculate total cost and assign it to di.
STEP8: If di<dj then make the temporary swap in STEP7 permanent to
form the new set of k medoid. Else undo the temporary swap done
in STEP7.
STEP9: Repeat STEP4,STEP5,STEP6,STEP7,STEP8. Until no
change;
PAM Algorithm
CLARA
CLARA (Kaufmann and Rousseeuw in 1990)
It draws a sample of the data set, applies PAM on this sample to determine the
optimal set of medoids from the sample.
Strength:
Deals with larger data sets than PAM.
Reduces Computational effort
Weakness:
Efficiency depends on the sample size.
A good clustering based on samples will not necessarily represent a good
clustering of the whole data set if the sample is biased.
CLARA draws a sample of the dataset and applies PAM on the
sample in order to find the medoids.
If the sample is best representative of the entire dataset then the
medoids of the sample should approximate
the medoids of the entire dataset.
Medoids are chosen from the sample.
The algorithm cannot find the best solution if one of the best k-
medoids is not among the selected sample.
CLARANS (“RANDOMIZED” CLARA)

CLARANS (A Clustering Algorithm based on Randomized Search)


CLARANS draws sample of neighbors dynamically.
The clustering process can be presented as searching a graph where every
node is a potential solution, that is, a set of k medoids.
If the local optimum is found, CLARANS starts with new randomly selected
node in search for a new local optimum.
It is more efficient and scalable than both PAM and CLARA.
Focusing techniques and spatial access structures may further improve its
performance.
CLARANS has two parameters:
Maxneighbour : number of pairs for swapping
Numlocal : number of optimal medoid sets
Steps involved :
CLARANS starts with a randomly selected set of k-medoids.
It checks “maxneighbor“ number of pairs for swapping.
CLARANS stops after the “numlocal” number of local optimal medoid
sets are determined and returns the best cluster
Drawbacks:
It assumes that all objects fit into the main memory, the result is very
sensitive to input order.
Due to the trimming of searching, controlled by ‘maxneighbor’ ,it may
not find a real local minimum.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects in D as the initial representative objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest representative object;
(4) randomly select a non-representative object, orandom;
(5) compute the total cost, S, of swapping representative object, oj , with orandom;
(6) if S < 0 then swap oj with orandom to form the new set of k representative objects;
(7) until no change;
DBSCAN
DBSCAN uses a density-based notion of clusters to discover clusters of arbitrary
shapes.
In density based clustering we partition points into dense regions separated by not-so-
dense regions.
Clusters are defined as Defined-Connected Sets (Epts , MinPts)
Clustering based on density (local cluster criterion), such as density-connected points
Density and connectivity are measured by local distribution of nearest neighbor

Major features: Click Here for


Video
Discover clusters of arbitrary shape
Handle noise
Need density parameters as termination condition
The DBSCAN algorithm basically requires 2 parameters:

Eps: specifies how close points should be to each other to be considered a part
of a cluster. It means that if the distance between two points is lower or equal to
this value (eps), these points are considered neighbors.

-Density at point p: number of points within a circle of radius Eps

MinPts: the minimum number of points to form a dense region. For example,
if we set the minPoints parameter as 5, then we need at least 5 points to form a
dense region.

Dense Region: A circle of radius Eps that contains at least MinPts points
Characterization of points
A point is a core point if it has more than a specified number of points
(MinPts) within Eps. These points belong in a dense region and are at the
interior of a cluster.
A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point.
A noise point is any point that is not a core point or a border point.
DBSCAN: Core, Border, and Noise points
The Example illustrates the expend-cluster phase of the algorithm
Assume MinPts=6
Start with the unclassified object O1 .We find that there are 6 objects in the
neighbourhood of O1 .Put all these points in the candidate object and associate
them with a cluster-id of O1 (Cluster C1 )
Select next object from candidate-objects, O2 . Neighbourhood of O2 doesn’t
contain adequate number of points. So mark O2 as a noise object
O4 is already marked as noise, Let O3 be the next object from candidate-object.
Neighbourhood of O3 contains 7 points, O1 O3 O5 O6 O9 O10 O11 among these O9
O10-noise objects, O1 O3 are already classified and others are unclassified.
7 objects are associated with C1. Unclassified objects are included in candidate-
objects for next iteration
BIRCH
BIRCH (balanced iterative reducing and clustering using hierarchies)is a
hierarchical – agglomerative clustering algorithm proposed by Zhang,
Ramakrishnan and Livny.
It is designed to cluster large datasets of n –Dimensional vectors using a
limited amount of main memory.
BIRCH Proposes a Special Data Structure called CF tree, Clusters Features are
maintained in a tree called B+ tree.
BIRCH requires one pass to construct CF tree.
The subsequent stages work on this tree rather than the actual database.
Last stage requires one more database pass.
CLUSTERING FEARTURES AND THE CF TREE

A major characteristic of the BIRCH algorithm is the use of the clustering feature, which
is a triple that contains information about a cluster.
CF =(n, ls, ss)
If O1=(x11,x12,x13) , O2=(x21,x22,x23),…, On=(xn1,xn2,xn3)
DEFINITION
A clustering feature (CF) is a triple (N, Ls, SS), where the number of the
points in the cluster is N, Ls is the sum of the points in the cluster, and SS is
the sum of the squares of the points in the cluster.

A CF tree is a balanced tree with a branching factor (maximum number of


children a node may have) B . Each internal node contains a CF triple for each
of its children. Each leaf node also represents a cluster and contains a CF
entry for each sub cluster in it. A sub cluster in a leaf node must have a
diameter no greater than a given threshold value T.
ADDITIVE PROPERTIES OF CLUSTER FEATURES
Basic Algorithm:

Phase 1: Construction of a CF Tree


Create initial CF tree "loads" the database into memory.
Identifying the Appropriate Leaf
Modifying the Leaf Node
Absorbing O in Li
Introduce O in the Leaf Node
Splitting of the Leaf Node
Modifying the path to the Leaf
Merging Refinement
Phase 2: Condensation of CF Tree
Resize the data set by building a smaller CF tree
Remove more outliers
Condensing is optional
Phase 3: Hierarchical Agglomerative Clustering
Global or Semi-Global Clustering
Use existing clustering algorithm (e.g. KMEANS, HC) on CF entries
Phase 4: Cluster refining
Refining is optional
Fixes the problem with CF trees where same valued data points may be
assigned to different leaf entries.
Outliers are removed
CURE
CURE – Clustering Using Representatives.
It is a sampling based hierarchical clustering technique adopting an agglomerative scheme
that is able to discover clusters of arbitrary shapes.
It uses a fixed number of points as representatives (partition)
Centroid based approach: uses 1 pt to represent cluster => too little information … sensitive
to data shapes.
A constant number c of well scattered points in a cluster are chosen, and then shrunk toward
the center of the cluster by a specified fraction alpha.
The distance between two sub cluster is calculated by
D closest(C)=Distance(C, C Nearest)
It maintains a heap data structure to determine the closest pair of sub clusters at every stage.
The clusters with the closest pair of representative points are merged at each step, Stops
when there are only k clusters left, where k can be specified.

You might also like