0% found this document useful (0 votes)

51 views125 pages

Clustering

Uploaded by

Fariya Afrin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views125 pages

Clustering

Uploaded by

Fariya Afrin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 125

Machine Learning

Concepts and Techniques

Clustering
More Examples of Clustering
Applications
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 Land use: Identification of areas of similar land use in an earth observation
database
 Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
 City-planning: Identifying groups of houses according to their house type,
value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
Recapitulation: Clustering
• The goal of clustering is to
– group data points that are close (or similar) to each other
– identify such groupings (or clusters) in an unsupervised manner
• Unsupervised: no information is provided to the algorithm on which
data points belong to which clusters
• Example
What should the clusters
be for these data points?
× ×

×
× ×
×
× ×
×
What is Clustering?
• Clustering can be considered the most important unsupervised
learning problem; so, as every other problem of this kind deals with
finding a structure in a collection of unlabeled data.

• A loose definition of clustering could be “the process of organizing

objects into groups whose members are similar in some way”.

• A cluster is therefore a collection of objects which are “similar”

between them and are “dissimilar” to the objects belonging to other
clusters.
Clustering Algorithms
 A clustering algorithm attempts to find natural groups of components (or
data) based on some similarity
 Also, the clustering algorithm finds the centroid of a group of data sets
 To determine cluster membership, most algorithms evaluate the distance
between a point and the cluster centroids
 The output from a clustering algorithm is basically a statistical description of
the cluster centroids with the number of components in each cluster.
• Simple graphical example:

 In this case we easily identify the 4 clusters into which the data can be
divided; the similarity criterion is distance: two or more objects belong to
the same cluster if they are “close” according to a given distance. This is
called distance-based clustering.
 Another kind of clustering is conceptual clustering: two or more objects
belong to the same cluster if this one defines a concept common to all that
objects.
 In other words, objects are grouped according to their fit to descriptive
concepts, not according to simple similarity measures.
Quality: What Is Good Clustering?
 A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity

 The quality of a clustering result depends on both the similarity measure

used by the method and its implementation

 The quality of a clustering method is also measured by its ability to

discover some or all of the hidden patterns
Measure the Quality of Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance
function, which is typically metric: d(i, j)
• There is a separate “quality” function that measures the “goodness” of a
cluster.
• The definitions of distance functions are usually very different for interval-
scaled, boolean, categorical, ordinal and ratio variables.
• Weights should be associated with different variables based on applications
and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Requirements of Clustering in Data Mining
 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to determine input
parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
Typical Alternatives to Calculate the Distance
between Clusters
 Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an element in
the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e., dis(K i, Kj) =
dis(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e., dis(K i, Kj) =
dis(Mi, Mj)
– Medoid: one chosen, centrally located object in the cluster
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
• Centroid: the “middle” of a cluster iN 1(t )
Cm  N
ip

• Radius: square root of average distance from any point of the cluster to its
centroid
 N (t  cm ) 2
Rm  i 1 ip
N

• Diameter: square root of average mean squared distance between all pairs
of points in the cluster

 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)
Partitioning Algorithms
• Partitioning method: Construct a partition of a database D of n objects into a
set of k clusters k
 m1tmiKm (Cm  tmi )
2

• Given a k, find a partition of k clusters that optimizes the chosen partitioning

criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen, 1967): Each cluster is represented by the center of
the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw,
1987): Each cluster is represented by one of the objects in the cluster
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in four steps:

– Partition objects into k nonempty subsets

– Compute seed points as the centroids of the clusters of the current

partition (the centroid is the center, i.e., mean point, of the cluster)
– Assign each object to the cluster with the nearest seed point

– Go back to Step 2, stop when no more new assignment

The K-Means Clustering Method
• Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

6 6
object as initial cluster 5 5

center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
Variations of the K-Means Method
• A few variants of the k-means which differ in
– Selection of the initial k means
– Dissimilarity calculations
– Strategies to calculate cluster means

• Handling categorical data: k-modes (Huang’98)

– Replacing means of clusters with modes
– Using new dissimilarity measures to deal with categorical objects
– Using a frequency-based method to update modes of clusters
– A mixture of categorical and numerical data: k-prototype method
What Is the Problem of the K-Means
Method?
• The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially distort the
distribution of the data.
• K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located
object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Example
• Height and weight information are given. Using these two variables,
we need to group the objects based on height and weight information.
Data Sample
Height Weight
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76
Step 1: Input
Dataset, Clustering Variables and Maximum Number of Clusters (K in
Means Clustering)
In this dataset, only two variables –height and weight – are considered for
clustering
Height Weight
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76
Step 2: Initialize cluster centroid
In this example, value of K is considered as 2. Cluster centroids are
initialized with first 2 observations.
Initial Centroid
Cluster Height Weight
K1 185 72
K2 170 56
Step 3: Calculate Euclidean Distance

Euclidean is one of the distance measures used on K Means algorithm. Euclidean

distance between of a observation and initial cluster centroids 1 and 2 is calculated.
Based on euclidean distance each observation is assigned to one of the clusters -
based on minimum distance.
Euclidean Distance
First two observations
Height Weight
185 72
170 56
Now initial cluster centroids are :
Updated Centroid
Cluster Height Weight
K1 185 72
K2 170 56
Euclidean Distance Calculation from each of the clusters is calculated.
Euclidian Distance from Euclidian Distance from
Cluster 1 Cluster 2 Assignment
(185-185)2+(72-72)2 =0 (185-170)2+(72-56)2= 21.93 1
(170-185)2+(56-72)2= 21.93 (170-170)2+(56-56)2= 0 2

We have considered two observations for assignment only because we knew the
assignment. And there is no change in Centroids as these two observations were
only considered as initial centroids.
Step 4: Move on to next observation and calculate Euclidean Distance
Height Weight
168 60
Euclidean Distance Euclidean Distance
from Cluster 1 from Cluster 2 Assignment
(168-185)2+(60-72)2 =20.808 (168-185)2+(60-72)2= 4.472 2

Since distance is minimum from cluster 2, so the observation is assigned to

cluster 2.
Now revise Cluster Centroid – mean value Height and Weight as Custer
Centroids. Addition is only to cluster 2, so centroid of cluster 2 will be
updated
Updated cluster centroids
Updated Centroid
Cluster Height Weight
K=1 185 72
K=2 (170+168)/2 = 169 (56+60)/2 = 58
Step 5: Calculate Euclidean Distance for the next observation, assign next
observation based on minimum euclidean distance and update the cluster
centroids.
Next Observation.
Height Weight
179 68
Euclidean Distance Calculation and Assignment
Euclidain Distance Euclidain Distance
from Cluster 1 from Cluster 2 Assignment
7.211103 14.14214 1
Update Cluster Centroid
Updated Centroid
Cluster Height Weight
K=1 182 70.6667
K=2 169 58
Continue the steps until all observations are assigned
Cluster Centroids
Updated
Centroid
Cluster Height Weight
K=1 182.8 72
K=2 169 58
This is what was expected initially based on two-dimensional plot.
A few important considerations in K Means

•Scale of measurements influences Euclidean Distance , so variable

standardisation becomes necessary
•Depending on expectations - you may require outlier treatment
•K Means clustering may be biased on initial centroids - called cluster
seeds
•Maximum clusters is typically inputs and may also impacts the clusters
getting created
n-features,m-objects, k-clusters
Comments on the K-Means Method
• Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t
is # iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimum. The global optimum may be
found using techniques such as: deterministic annealing and genetic
algorithms
• Weakness
– Applicable only when mean is defined, then what about categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids, 1987)
– starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
– PAM works effectively for small data sets, but does not scale well for
large data sets
• CLARA (Kaufmann & Rousseeuw, 1990)
• CLARANS (Ng & Han, 1994): Randomized sampling
• Focusing + spatial data structure (Ester et al., 1995)
Confusion Matrix
Confusion Matrix
Example with 10 Classes
ROC CURVE and AUC
An ROC curve is a commonly used way to visualize the performance of a
binary classifier, meaning a classifier with two possible output classes.
• For example, let's pretend you
built a classifier to predict
whether a research paper will be
admitted to a journal, based on a
variety of factors. The features
might be the length of the paper,
the number of authors, the
number of papers those authors
have previously submitted to the
journal, et cetera. The response
(or "output variable") would be
whether or not the paper was
admitted.
Confusion Matrix
Confusion Matrix: Contd...
Confusion Matrix: Contd...
Measusring Accuracy using Confusion Matrix
Measusring Accuracy using Confusion Matrix...
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7 7

6
Arbitrary 6
Assign 6

5
choose k 5 each 5

4 object as 4 remaining 4

3
initial 3
object to 3

2
medoids 2
nearest 2

1 1 1

0 0
medoids 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a

Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9
Compute
9

8 8
Swapping O total cost of
Until no change
7 7

and Oramdom 6
swapping 6

5 5

If quality is 4 4

improved. 3

2
3

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Hierarchical Clustering
• Clusters are created in levels actually creating sets of clusters at each
level.
• Agglomerative
– Initially each item in its own cluster
– Iteratively clusters are merged together
– Bottom Up
• Divisive
– Initially all items in one cluster
– Large clusters are successively divided
– Top Down
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method does not require
the number of clusters k as an input, but needs a termination condition
 Illustrative Example:
 Agglomerative and divisive clustering on the data set {a, b, c, d ,e }
Step 0 Step 1 Step 2 Step 3 Step 4
Agglomerative

a
ab
b  Cluster distance
abcde
c  Termination
cde condition
d
de
e
Divisive
Step 4 Step 3 Step 2 Step 1 Step 0
Hierarchical Agglomerative Clustering
(HAC)
• Starts with each doc in a separate cluster
– then repeatedly joins the closest pair of clusters, until there is only one
cluster.
• The history of merging forms a binary tree or hierarchy.
How to measure distance of clusters??
Closest pair of clusters
Many variants to defining closest pair of clusters
• Single-link
– Distance of the “closest” points (single-link)
• Complete-link
– Distance of the “farthest” points
• Centroid
– Distance of the centroids (centers of gravity)
• (Average-link)
– Average distance between pairs of elements
Cluster Distance Measures
single link
• (min)
Single link: smallest distance between an
element in one cluster and an element in
the other, i.e., d(Ci, Cj) = min{d(xip, xjq)}

• Complete link: largest distance between complete link

an element in one cluster and an element (max)
in the other, i.e., d(Ci, Cj) = max{d(xip,
xjq)}

• Average: avg distance between elements

average
in one cluster and elements in the other,
i.e.,
d(Ci, Cj) = avg{d(xip, xjq)

d(C, C)=0
Dendrogram
• Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
• Each level shows clusters for
that level.
– Leaf – individual clusters
– Root – one cluster
• A cluster at level i is the union of
its children clusters at level i+1.
Cluster Distance Measures
Example: Given a data set of five objects characterized by a single continuous feature,
assume that there are two clusters: C1: {a, b} and C2: {c, d, e}.

a b c d e
Feature 1 2 4 5 6

1.Calculate the distance matrix.

2.Calculate three cluster distances between C 1 and C2.

Single link
a b c d e dist(C1 , C 2 )  min{ d(a, c) , d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
 min{3, 4, 5, 2, 3, 4}  2
a 0 1 3 4 5

b 1 0 2 3 4 Complete link
dist(C1 , C 2 )  max{d(a, c) , d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
c 3 2 0 1 2  max{3, 4, 5, 2, 3, 4}  5
d 4 3 1 0 1
Average d(a, c)  d(a, d)  d(a, e)  d(b, c)  d(b, d)  d(b, e)
dist(C1 , C 2 ) 
e 5 4 2 1 0 6
3  4  5  2  3  4 21
   3.5
6 6
Agglomerative Algorithm
• The Agglomerative algorithm is carried out in three steps:
1) Convert all object features into a
distance matrix
2) Set each object as a cluster (thus if
we have N objects, we will have N
clusters at the beginning)
3) Repeat until number of cluster is
one (or known # of clusters)
 Merge two closest clusters
 Update “distance matrix”
Example
• Problem: clustering analysis with agglomerative algorithm

data matrix

Euclidean distance
distance matrix
Example
• Merge two closest clusters (iteration 1)
Example
• Update distance matrix (iteration 1)
Example
• Merge two closest clusters (iteration 2)
Example
• Update distance matrix (iteration 2)
Example
• Merge two closest clusters/update distance matrix (iteration 3)
Example
• Merge two closest clusters/update distance matrix (iteration 4)
Example
• Final result (meeting termination condition)
Example
• Dendrogram tree representation

1. There are 6 clusters: A, B, C, D, E and

F
2. Merge clusters D and F into cluster (D,
6 F) at distance 0.50
3. Merge cluster A and cluster B into (A,
B) at distance 0.71
e
lifetim

4. Merge clusters E and (D, F) into ((D,

F), E) at distance 1.00
5 5. Merge clusters ((D, F), E) and C into
(((D, F), E), C) at distance 1.41
4 6. Merge clusters (((D, F), E), C) and (A,
B) into ((((D, F), E), C), (A, B))
3 at distance 2.50
2
7. The last cluster contain all the objects,
thus conclude the computation

object
Exercise
Given a data set of five objects characterised by a single continuous feature:
a b C d e
Feature 1 2 4 5 6

Apply the agglomerative algorithm with single-link, complete-link and averaging cluster
distance measures to produce three dendrogram trees, respectively.
a b c d e

a 0 1 3 4 5

b 1 0 2 3 4

c 3 2 0 1 2

d 4 3 1 0 1

e 5 4 2 1 0
Density-Based Clustering Algorithms
Density-Based Clustering
• Clustering based on density (local cluster criterion), such as density-
connected points or based on an explicitly constructed density
function
• This connected dense component which can grow in any direction
that density leads.
• Density, connectivity and boundary
• Arbitrary shaped clusters and good scalability
• Each cluster has a considerable higher density of points than outside
of the cluster
Major Features
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters
Two Major Types of Density-Based
Clustering Algorithms
• Connectivity based:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– CLIQUE: Agrawal, et al. (SIGMOD’98)

• Density function based:

- DENCLUE: Hinneburg & D. Keim (KDD’98/2006)
Density Based Clustering: Basic Concept
• Intuition for the formalization of the basic idea
– For any point in a cluster, the local point density around that point has to
exceed some threshold
– The set of points from one cluster is connected
• Local point density at a point p defined by two parameters
– ε – radius for the neighborhood of point p:
Nε (p) := {q in data set D | dist(p, q)  ε}
– MinPts – minimum number of points in the given neighbourhood N(p)
-Neighborhood
• -Neighborhood – Objects within a radius of  from an object.
N  ( p ) : {q | d ( p, q )   }
• “High density” - ε-Neighborhood of an object contains at least MinPts
of objects.

ε-Neighborhood of p
ε ε ε-Neighborhood of q
qq pp
Density of p is “high” (MinPts = 4)
Density of q is “low” (MinPts = 4)
Core, Border & Outlier
Given  and MinPts, categorize the
objects into three exclusive groups.

A point is a core point if it has more

than a specified number of points
(MinPts) within Eps These are points
that are at the interior of a cluster.

A border point has fewer than

MinPts within Eps, but is in the
neighborhood of a core point.

A noise point is any point that is

not a core point nor a border point.
Example
• M, P, O, and R are core objects since each is in an Eps neighborhood
containing at least 3 points

Minpts = 3
Eps=radius
of the circles
Density-Reachability

 Directly density-reachable
 An object q is directly density-reachable from object p if p is a
core object and q is in p’s -neighborhood.

 q is directly density-reachable from p

 p is not directly density- reachable from q?
q MinPts = 5
 Density-reachability is asymmetric.
p Eps = 1 cm
Density-Reachability
• Density-Reachable (directly and indirectly):
– A point p is directly density-reachable from p1;
– p1 is directly density-reachable from q; p
– pp1q form a chain.
p1
q
• p is (indirectly) density-reachable from q
• q is not density- reachable from p?

• Density-connected
– A point p is density-connected to a point q wrt. p q
Eps, MinPts if there is a point o such that both,
p and q are density-reachable from o wrt. Eps
and MinPts. o
Formal Description of Cluster
• Given a data set D, parameter  and threshold MinPts.
• A cluster C is a subset of objects satisfying two criteria:
– Connected: p, q C: p and q are density-connected.
– Maximal: p, q: if p C and q is density-reachable from p, then q C.
(avoid redundancy)

P is a core object.
Review of Concepts
Is an object o in a cluster or Are objects p and q in the
an outlier? same cluster?

Are p and q density-

Is o a core object?
connected?

Is o density-reachable by Are p and q density-

some core object? reachable by some object o?

Directly density- Indirectly density-reachable

reachable through a chain
DBSCAN Algorithm
Input: The data set D
Parameter: , MinPts
For each object p in D
if p is a core object and not processed then
C = retrieve all objects density-reachable from p
mark all objects in C as processed
report C as a cluster
else mark p as outlier
end if
End For

DBScan Algorithm
DBSCAN: The Algorithm
– Arbitrary select a point p

– Retrieve all points density-reachable from p wrt Eps and MinPts.

– If p is a core point, a cluster is formed.

– If p is a border point, no points are density-reachable from p and

DBSCAN visits the next point of the database.

– Continue the process until all of the points have been processed.
DBSCAN Algorithm: Example
• Parameter
• e = 2 cm
• MinPts = 3

for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable
from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example
• Parameter
• e = 2 cm
• MinPts = 3

for each o Î D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable
from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Advantages
• DBSCAN does not require to specify the number of clusters in the data
apriori, as opposed to k-means.
• DBSCAN can find arbitrarily shaped clusters. It can even find a cluster
completely surrounded by (but not connected to) a different cluster. Due to
the MinPts parameter, the so-called single-link effect (different clusters being
connected by a thin line of points) is reduced.
• DBSCAN has a notion of noise, and is robust to outliers.
• DBSCAN requires just two parameters and is mostly insensitive to the
ordering of the points in the database. (However, points sitting on the edge of
two different clusters might swap cluster membership if the ordering of the
points is changed, and the cluster assignment is unique only up to
isomorphism.)
• The parameters minPts and ε can be set by a domain expert, if the data is well
understood.
DBSCAN Algorithm: Disadvantages
• DBSCAN is not entirely deterministic: border points that are reachable from
more than one cluster can be part of either cluster, depending on the order the
data is processed. Fortunately, this situation does not arise often, and has little
impact on the clustering result: both on core points and noise points,
DBSCAN is deterministic.
• The quality of DBSCAN depends on the distance measure used in the
function regionQuery (P, ε). The most common distance metric used
is Euclidean distance. Especially for high-dimensional data, this metric can be
rendered almost useless due to the so-called "Curse of dimensionality",
making it difficult to find an appropriate value for ε. This effect, however, is
also present in any other algorithm based on Euclidean distance.
• DBSCAN cannot cluster data sets well with large differences in densities,
since the minPts-ε combination cannot then be chosen appropriately for all
clusters.
• If the data and scale are not well understood, choosing a meaningful distance
threshold ε can be difficult.
Steps of Grid-based Clustering
Algorithms
Basic Grid-based Algorithm
1. Define a set of grid-cells
2. Assign objects to the appropriate grid cell and compute the density
of each cell.
3. Eliminate cells, whose density is below a certain threshold .
4. Form clusters from contiguous (adjacent) groups of dense cells
(usually minimizing a given objective function)
Advantages of Grid-based Clustering Algorithms
• fast:
– No distance computations
– Clustering is performed on summaries and not individual objects;
complexity is usually O(#-populated-grid-cells) and not O(#objects)
– Easy to determine which clusters are neighboring
• Shapes are limited to union of grid-cells
Grid-Based Clustering Methods
• Grid-based methods quantize the object space into a finite number of cells
that form a gird structure (Uses multi-resolution grid data structure).
• All the clustering operations are performed on the grid structure.
• Clustering complexity depends on the number of populated grid cells and
not on the number of objects in the dataset
• Several interesting methods (in addition to the basic grid-based algorithm)
– STING (a STatistical INformation Grid approach) by Wang, Yang and
Muntz (1997)
– CLIQUE: Agrawal, et al. (SIGMOD’98)
STING: A Statistical Information Grid
Approach
• Wang, Yang and Muntz (VLDB’97)
• The spatial area is divided into rectangular cells
• There are several levels of cells corresponding to different levels of
resolution
STING: A Statistical Information Grid
Approach (2)
– Each cell at a high level is partitioned into a number of smaller cells in the
next lower level
– Statistical info of each cell is calculated and stored beforehand and is
used to answer queries
– Parameters of higher level cells can be easily calculated from parameters
of lower level cell
• count, mean, s, min, max
• type of distribution—normal, uniform, etc.
– Use a top-down approach to answer spatial data queries
STING: Query Processing(3)
Used a top-down approach to answer spatial data queries
1.Start from a pre-selected layer—typically with a small number of cells
2.From the pre-selected layer until you reach the bottom layer do the following:
• For each cell in the current level compute the confidence interval
indicating a cell’s relevance to a given query;
– If it is relevant, include the cell in a cluster
– If it irrelevant, remove cell from further consideration
– otherwise, look for relevant cells at the next lower layer
3.Combine relevant cells into relevant regions (based on grid-neighborhood)
and return the so obtained clusters as your answers.
STING: A Statistical Information Grid
Approach (3)
– Advantages:
• Query-independent, easy to parallelize, incremental update
• O(K), where K is the number of grid cells at the lowest level
– Disadvantages:
• All the cluster boundaries are either horizontal or vertical, and no
diagonal boundary is detected

21csc305p Machine Learning Unit 3 - Updated
No ratings yet
21csc305p Machine Learning Unit 3 - Updated
147 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Clustering
No ratings yet
Clustering
84 pages
Forensic Toxicology
90% (10)
Forensic Toxicology
165 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Clustering
No ratings yet
Clustering
104 pages
Unit 5
No ratings yet
Unit 5
85 pages
10 Lecture AI 10
No ratings yet
10 Lecture AI 10
48 pages
BIS 541 Ch04 20-21 S
No ratings yet
BIS 541 Ch04 20-21 S
82 pages
Hyundai Noida
100% (1)
Hyundai Noida
56 pages
Session 3-Clustering
No ratings yet
Session 3-Clustering
41 pages
Week 9
No ratings yet
Week 9
66 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
32 pages
Unit 4
No ratings yet
Unit 4
125 pages
Intro to Cluster Analysis
No ratings yet
Intro to Cluster Analysis
90 pages
Module 5
No ratings yet
Module 5
98 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Session 37 CO4 Unsupervised Learning
No ratings yet
Session 37 CO4 Unsupervised Learning
34 pages
Cluster
No ratings yet
Cluster
50 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
Unit 4
No ratings yet
Unit 4
74 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
Week 10 Lecture - Introduction To Clustering
No ratings yet
Week 10 Lecture - Introduction To Clustering
35 pages
07 Clustering
No ratings yet
07 Clustering
34 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
K Means
No ratings yet
K Means
40 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
45 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
K Means Clustering
No ratings yet
K Means Clustering
29 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
Unit 4
No ratings yet
Unit 4
29 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
Cluster Analysis and K-Means Guide
No ratings yet
Cluster Analysis and K-Means Guide
20 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
(3rd Year) Pattern REcognition Lecture 4
No ratings yet
(3rd Year) Pattern REcognition Lecture 4
48 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Cluster Analysis: Methods and Applications
No ratings yet
Cluster Analysis: Methods and Applications
14 pages
Clustering
No ratings yet
Clustering
29 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
K-Means Clustering Guide 2023
No ratings yet
K-Means Clustering Guide 2023
14 pages
Triguna Concept in Indian Psychology
No ratings yet
Triguna Concept in Indian Psychology
18 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
The Math Behind The K-Means and Hierarchical Clust+
No ratings yet
The Math Behind The K-Means and Hierarchical Clust+
13 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
K Mean
No ratings yet
K Mean
7 pages
Soal Bahasa Inggris Bab Colors Warna-Warna Dan Kunci Jawaban
No ratings yet
Soal Bahasa Inggris Bab Colors Warna-Warna Dan Kunci Jawaban
8 pages
Essay Questions
100% (1)
Essay Questions
5 pages
HSSC Cet: HKRHZ Ijh (Kk&2025
No ratings yet
HSSC Cet: HKRHZ Ijh (Kk&2025
128 pages
Ucs Director Admin Guide
No ratings yet
Ucs Director Admin Guide
164 pages
Career Counseling Competencies Guide
100% (1)
Career Counseling Competencies Guide
13 pages
The Autoimmune Epidemic by Human Garage
No ratings yet
The Autoimmune Epidemic by Human Garage
12 pages
Marketting Plan For TATA NEXON EV Group 9
100% (1)
Marketting Plan For TATA NEXON EV Group 9
17 pages
Pre-Alternative Algebras and Pre-Alternative Bialgebras: Abstract
No ratings yet
Pre-Alternative Algebras and Pre-Alternative Bialgebras: Abstract
34 pages
Entrepreneur 3
No ratings yet
Entrepreneur 3
24 pages
Friends - The One With Russ
No ratings yet
Friends - The One With Russ
15 pages
Lennox Manual
No ratings yet
Lennox Manual
12 pages
Tcode Description: 000000sensitivity: Internal Restricted
No ratings yet
Tcode Description: 000000sensitivity: Internal Restricted
7 pages
The Cause-Effect Essay
No ratings yet
The Cause-Effect Essay
12 pages
Narrative Kelas Xi
No ratings yet
Narrative Kelas Xi
10 pages
The Terror: About This Text
0% (2)
The Terror: About This Text
6 pages
Gold Loan Marketing Strategies Study
No ratings yet
Gold Loan Marketing Strategies Study
51 pages
Dementia Care Case Studies
No ratings yet
Dementia Care Case Studies
3 pages
Bakery Secrets and a Holocaust Survivor
No ratings yet
Bakery Secrets and a Holocaust Survivor
4 pages
Business Apps Boost Efficiency
No ratings yet
Business Apps Boost Efficiency
3 pages
Walmart Display Makes and Models 2
No ratings yet
Walmart Display Makes and Models 2
1 page
Py Bom 13729140000069375
No ratings yet
Py Bom 13729140000069375
2 pages
Cluster Analysis for CS Students
No ratings yet
Cluster Analysis for CS Students
43 pages
Sueño Pitsburg
No ratings yet
Sueño Pitsburg
9 pages
Zeeshan CV
No ratings yet
Zeeshan CV
1 page
Hafiz M Shahbaz Rafique: Objective
No ratings yet
Hafiz M Shahbaz Rafique: Objective
1 page
I'm Yours Lyrics for Singers
No ratings yet
I'm Yours Lyrics for Singers
2 pages
20 Recipes For Oxtail
No ratings yet
20 Recipes For Oxtail
4 pages
United States District Court Northern District of California
No ratings yet
United States District Court Northern District of California
1 page

Clustering

Uploaded by

Clustering

Uploaded by

Machine Learning

Concepts and Techniques

• A loose definition of clustering could be “the process of organizing

• A cluster is therefore a collection of objects which are “similar”

 The quality of a clustering result depends on both the similarity measure

 The quality of a clustering method is also measured by its ability to

• Given a k, find a partition of k clusters that optimizes the chosen partitioning

– Partition objects into k nonempty subsets

– Compute seed points as the centroids of the clusters of the current

– Go back to Step 2, stop when no more new assignment

• Handling categorical data: k-modes (Huang’98)

Euclidean is one of the distance measures used on K Means algorithm. Euclidean

Since distance is minimum from cluster 2, so the observation is assigned to

•Scale of measurements influences Euclidean Distance , so variable

K=2 Randomly select a

• Complete link: largest distance between complete link

• Average: avg distance between elements

1.Calculate the distance matrix.

1. There are 6 clusters: A, B, C, D, E and

4. Merge clusters E and (D, F) into ((D,

• Density function based:

A point is a core point if it has more

A border point has fewer than

A noise point is any point that is

 q is directly density-reachable from p

Are p and q density-

Is o density-reachable by Are p and q density-

Directly density- Indirectly density-reachable

– Retrieve all points density-reachable from p wrt Eps and MinPts.

– If p is a core point, a cluster is formed.

– If p is a border point, no points are density-reachable from p and

You might also like