Unit Ii DM
Unit Ii DM
UNIT - II
UNIT - II
Problem decomposition:
The problem of mining association rules can be decomposed into two sub problems.
Find all the frequent itemsets.
Generate association rules from the above frequent itemsets.
DEFINITIONS
Frequent set: The sets of item which has minimum support (denoted by Li for
ith-Itemset).
Downward Closure Property: Any subset of a frequent set must be frequent
set.
Upward Closure Property: Any superset of an infrequent set is an infrequent
set.
Maximal frequent set: A frequent set is a maximal frequent set if it is a
frequent set and no superset of this is a frequent set.
Border Set: An itemset is a border set if it is not a frequent set, but all its proper
subsets are frequent sets.
APRIORI ALGORITHM
Level wise algorithm.
Proposed by Agrawal and Srikant in 1994.
Uses downward closure property.
Bottom up search
CANDIDATE GENERATION
PRUNING
APRIORI ALGORITHM BY EXAMPLE
CLUSTERING
Two types:
-k-means algorithms
-k-medoid algorithms
Two types
-Agglomerative
-Divisive
AGGLOMERATIVE CLUSTERING
Agglomerative clustering techniques is a bottom-up approach,
initially, each data point is a cluster of its own, further pairs of
clusters are merged as one moves up the hierarchy.
Numerical data-
The geometric properties can be used to define the distances between the points.
Numerical data refers to the data that is in the form of numbers, and not in any language
or descriptive form.
It has ability to be statistically and arithmetically calculated
Categorical data-
Consists of categorical attributes ,on which distance functions are not naturally defined.
Categorical data refers to a data type that can be stored and identified based on the names
or labels given to them.
The data collected in the categorical form is also known as qualitative data.
PARTITIONING ALGORITHMS
Among these PAM is known to be most powerful and considered to be used widely.
It cannot handle large volumes of data
However, PAM has a drawback due to its time complexity
K-MEDOIDS ALGORITHMS
PAM (Partition Around Medoids)
PAM uses a k-Medoid method to identify the clusters.
The algorithm has two important modules:
The Partitioning of the database for a given set of medoids
The Iterative selection of medoids.
Partitioning :
Oj – non-selected object
Oi- medoid
Oj belongs to Oi
If d(Oi,Oj) = Minoed(Oj,Oj) where minimum is taken over all medoids Oe and d(Oa,Ob)
determines the distance , or dissimilarity , between objects O a and Ob.
Disimilarity matrix is known in prior to commencement of PAM
K-MEDOIDS ALGORITHMS (CONTD)
Iterative selection of Medoids
The effect of swapping Oi and Oh is that an unselected object becomes
a medoid replacing an existing medoid.
The new set of k-medoids is Kmed’ = {O1,O2,..Oi-1,Oh,…Ok} where Oh
replaces the Oi as one of the medoids from Kmed={O1,O2,.…Ok}.
Now quality of clustering is compared
If the cost value is negative number, make swapping permanent then
next random selection of medoid is made.
If the cost value is positive number, undo the swap, then the optimized
clusters are formed
Let’s consider the following example:
If a graph is drawn using the above data points, we obtain the following:
Step #1: k = 2
Let the randomly selected 2 medoids be C1 -(3, 4) and C2 -(7, 4).
Step #2: Calculating cost.
The dissimilarity of each non-medoid point with the medoids is calculated and tabulated:
Each point is assigned to the cluster of that medoid whose dissimilarity is less.
The points 1, 2, 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.
The cost C = (3 + 4 + 4) + (3 + 1 + 1 + 2 + 2)
C = 20
Step #3: Now randomly select one non-medoid point and recalculate the cost.
Let the randomly selected point be (7, 3). The dissimilarity of each non-medoid point
with the medoids – C1 (3, 4) and C2 (7, 3) is calculated and tabulated.
Each point is assigned to that cluster whose dissimilarity is less. So, the points 1, 2, 5 go to cluster C1
and 0, 3, 6, 7, 8 go to cluster C2.
The cost C = (3 + 4 + 4) + (2 + 2 + 1 + 3 + 3)
C = 22
Swap Cost = Present Cost – Previous Cost
= 22 – 20 = 2 >0
As the swap cost is not less than zero, we undo the swap. Hence (3, 4) and (7, 4) are the final medoids.
The clustering would be in the following way
STEPS INVOLVED IN K-MEDOID ALGORITHM
Eps: specifies how close points should be to each other to be considered a part
of a cluster. It means that if the distance between two points is lower or equal to
this value (eps), these points are considered neighbors.
MinPts: the minimum number of points to form a dense region. For example,
if we set the minPoints parameter as 5, then we need at least 5 points to form a
dense region.
Dense Region: A circle of radius Eps that contains at least MinPts points
Characterization of points
A point is a core point if it has more than a specified number of points
(MinPts) within Eps. These points belong in a dense region and are at the
interior of a cluster.
A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point.
A noise point is any point that is not a core point or a border point.
DBSCAN: Core, Border, and Noise points
The Example illustrates the expend-cluster phase of the algorithm
Assume MinPts=6
Start with the unclassified object O1 .We find that there are 6 objects in the
neighbourhood of O1 .Put all these points in the candidate object and associate
them with a cluster-id of O1 (Cluster C1 )
Select next object from candidate-objects, O2 . Neighbourhood of O2 doesn’t
contain adequate number of points. So mark O2 as a noise object
O4 is already marked as noise, Let O3 be the next object from candidate-object.
Neighbourhood of O3 contains 7 points, O1 O3 O5 O6 O9 O10 O11 among these O9
O10-noise objects, O1 O3 are already classified and others are unclassified.
7 objects are associated with C1. Unclassified objects are included in candidate-
objects for next iteration
BIRCH
BIRCH (balanced iterative reducing and clustering using hierarchies)is a
hierarchical – agglomerative clustering algorithm proposed by Zhang,
Ramakrishnan and Livny.
It is designed to cluster large datasets of n –Dimensional vectors using a
limited amount of main memory.
BIRCH Proposes a Special Data Structure called CF tree, Clusters Features are
maintained in a tree called B+ tree.
BIRCH requires one pass to construct CF tree.
The subsequent stages work on this tree rather than the actual database.
Last stage requires one more database pass.
CLUSTERING FEARTURES AND THE CF TREE
A major characteristic of the BIRCH algorithm is the use of the clustering feature, which
is a triple that contains information about a cluster.
CF =(n, ls, ss)
If O1=(x11,x12,x13) , O2=(x21,x22,x23),…, On=(xn1,xn2,xn3)
DEFINITION
A clustering feature (CF) is a triple (N, Ls, SS), where the number of the
points in the cluster is N, Ls is the sum of the points in the cluster, and SS is
the sum of the squares of the points in the cluster.