0% found this document useful (0 votes)
51 views125 pages

Clustering

Uploaded by

Fariya Afrin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views125 pages

Clustering

Uploaded by

Fariya Afrin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 125

Machine Learning

Concepts and Techniques

Clustering
More Examples of Clustering
Applications
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 Land use: Identification of areas of similar land use in an earth observation
database
 Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
 City-planning: Identifying groups of houses according to their house type,
value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
Recapitulation: Clustering
• The goal of clustering is to
– group data points that are close (or similar) to each other
– identify such groupings (or clusters) in an unsupervised manner
• Unsupervised: no information is provided to the algorithm on which
data points belong to which clusters
• Example
What should the clusters
be for these data points?
× ×

×
× ×
×
× ×
×
What is Clustering?
• Clustering can be considered the most important unsupervised
learning problem; so, as every other problem of this kind deals with
finding a structure in a collection of unlabeled data.

• A loose definition of clustering could be “the process of organizing


objects into groups whose members are similar in some way”.

• A cluster is therefore a collection of objects which are “similar”


between them and are “dissimilar” to the objects belonging to other
clusters.
Clustering Algorithms
 A clustering algorithm attempts to find natural groups of components (or
data) based on some similarity
 Also, the clustering algorithm finds the centroid of a group of data sets
 To determine cluster membership, most algorithms evaluate the distance
between a point and the cluster centroids
 The output from a clustering algorithm is basically a statistical description of
the cluster centroids with the number of components in each cluster.
• Simple graphical example:

 In this case we easily identify the 4 clusters into which the data can be
divided; the similarity criterion is distance: two or more objects belong to
the same cluster if they are “close” according to a given distance. This is
called distance-based clustering.
 Another kind of clustering is conceptual clustering: two or more objects
belong to the same cluster if this one defines a concept common to all that
objects.
 In other words, objects are grouped according to their fit to descriptive
concepts, not according to simple similarity measures.
Quality: What Is Good Clustering?
 A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity

 The quality of a clustering result depends on both the similarity measure


used by the method and its implementation

 The quality of a clustering method is also measured by its ability to


discover some or all of the hidden patterns
Measure the Quality of Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance
function, which is typically metric: d(i, j)
• There is a separate “quality” function that measures the “goodness” of a
cluster.
• The definitions of distance functions are usually very different for interval-
scaled, boolean, categorical, ordinal and ratio variables.
• Weights should be associated with different variables based on applications
and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Requirements of Clustering in Data Mining
 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to determine input
parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
Typical Alternatives to Calculate the Distance
between Clusters
 Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an element in
the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e., dis(K i, Kj) =
dis(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e., dis(K i, Kj) =
dis(Mi, Mj)
– Medoid: one chosen, centrally located object in the cluster
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
• Centroid: the “middle” of a cluster iN 1(t )
Cm  N
ip

• Radius: square root of average distance from any point of the cluster to its
centroid
 N (t  cm ) 2
Rm  i 1 ip
N

• Diameter: square root of average mean squared distance between all pairs
of points in the cluster

 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)
Partitioning Algorithms
• Partitioning method: Construct a partition of a database D of n objects into a
set of k clusters k
 m1tmiKm (Cm  tmi )
2

• Given a k, find a partition of k clusters that optimizes the chosen partitioning


criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen, 1967): Each cluster is represented by the center of
the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw,
1987): Each cluster is represented by one of the objects in the cluster
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in four steps:

– Partition objects into k nonempty subsets

– Compute seed points as the centroids of the clusters of the current


partition (the centroid is the center, i.e., mean point, of the cluster)
– Assign each object to the cluster with the nearest seed point

– Go back to Step 2, stop when no more new assignment


The K-Means Clustering Method
• Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

6 6
object as initial cluster 5 5

center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
Variations of the K-Means Method
• A few variants of the k-means which differ in
– Selection of the initial k means
– Dissimilarity calculations
– Strategies to calculate cluster means

• Handling categorical data: k-modes (Huang’98)


– Replacing means of clusters with modes
– Using new dissimilarity measures to deal with categorical objects
– Using a frequency-based method to update modes of clusters
– A mixture of categorical and numerical data: k-prototype method
What Is the Problem of the K-Means
Method?
• The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially distort the
distribution of the data.
• K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located
object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Example
• Height and weight information are given. Using these two variables,
we need to group the objects based on height and weight information.
Data Sample
Height Weight
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76
Step 1: Input
Dataset, Clustering Variables and Maximum Number of Clusters (K in
Means Clustering)
In this dataset, only two variables –height and weight – are considered for
clustering
Height Weight
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76
Step 2: Initialize cluster centroid
In this example, value of K is considered as 2. Cluster centroids are
initialized with first 2 observations.
Initial Centroid
Cluster Height Weight
K1 185 72
K2 170 56
Step 3: Calculate Euclidean Distance

Euclidean is one of the distance measures used on K Means algorithm. Euclidean


distance between of a observation and initial cluster centroids 1 and 2 is calculated.
Based on euclidean distance each observation is assigned to one of the clusters -
based on minimum distance.
Euclidean Distance
First two observations
Height Weight
185 72
170 56
Now initial cluster centroids are :
Updated Centroid
Cluster Height Weight
K1 185 72
K2 170 56
Euclidean Distance Calculation from each of the clusters is calculated.
Euclidian Distance from Euclidian Distance from
Cluster 1 Cluster 2 Assignment
(185-185)2+(72-72)2 =0 (185-170)2+(72-56)2= 21.93 1
(170-185)2+(56-72)2= 21.93 (170-170)2+(56-56)2= 0 2

We have considered two observations for assignment only because we knew the
assignment. And there is no change in Centroids as these two observations were
only considered as initial centroids.
Step 4: Move on to next observation and calculate Euclidean Distance
Height Weight
168 60
Euclidean Distance Euclidean Distance
from Cluster 1 from Cluster 2 Assignment
(168-185)2+(60-72)2 =20.808 (168-185)2+(60-72)2= 4.472 2

Since distance is minimum from cluster 2, so the observation is assigned to


cluster 2.
Now revise Cluster Centroid – mean value Height and Weight as Custer
Centroids. Addition is only to cluster 2, so centroid of cluster 2 will be
updated
Updated cluster centroids
Updated Centroid
Cluster Height Weight
K=1 185 72
K=2 (170+168)/2 = 169 (56+60)/2 = 58
Step 5: Calculate Euclidean Distance for the next observation, assign next
observation based on minimum euclidean distance and update the cluster
centroids.
Next Observation.
Height Weight
179 68
Euclidean Distance Calculation and Assignment
Euclidain Distance Euclidain Distance
from Cluster 1 from Cluster 2 Assignment
7.211103 14.14214 1
Update Cluster Centroid
Updated Centroid
Cluster Height Weight
K=1 182 70.6667
K=2 169 58
Continue the steps until all observations are assigned
Cluster Centroids
Updated
Centroid
Cluster Height Weight
K=1 182.8 72
K=2 169 58
This is what was expected initially based on two-dimensional plot.
A few important considerations in K Means

•Scale of measurements influences Euclidean Distance , so variable


standardisation becomes necessary
•Depending on expectations - you may require outlier treatment
•K Means clustering may be biased on initial centroids - called cluster
seeds
•Maximum clusters is typically inputs and may also impacts the clusters
getting created
n-features,m-objects, k-clusters
Comments on the K-Means Method
• Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t
is # iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimum. The global optimum may be
found using techniques such as: deterministic annealing and genetic
algorithms
• Weakness
– Applicable only when mean is defined, then what about categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids, 1987)
– starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
– PAM works effectively for small data sets, but does not scale well for
large data sets
• CLARA (Kaufmann & Rousseeuw, 1990)
• CLARANS (Ng & Han, 1994): Randomized sampling
• Focusing + spatial data structure (Ester et al., 1995)
Confusion Matrix
Confusion Matrix
Example with 10 Classes
ROC CURVE and AUC
An ROC curve is a commonly used way to visualize the performance of a
binary classifier, meaning a classifier with two possible output classes.
• For example, let's pretend you
built a classifier to predict
whether a research paper will be
admitted to a journal, based on a
variety of factors. The features
might be the length of the paper,
the number of authors, the
number of papers those authors
have previously submitted to the
journal, et cetera. The response
(or "output variable") would be
whether or not the paper was
admitted.
Confusion Matrix
Confusion Matrix: Contd...
Confusion Matrix: Contd...
Measusring Accuracy using Confusion Matrix
Measusring Accuracy using Confusion Matrix...
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7 7

6
Arbitrary 6
Assign 6

5
choose k 5 each 5

4 object as 4 remaining 4

3
initial 3
object to 3

2
medoids 2
nearest 2

1 1 1

0 0
medoids 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9
Compute
9

8 8
Swapping O total cost of
Until no change
7 7

and Oramdom 6
swapping 6

5 5

If quality is 4 4

improved. 3

2
3

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Hierarchical Clustering
• Clusters are created in levels actually creating sets of clusters at each
level.
• Agglomerative
– Initially each item in its own cluster
– Iteratively clusters are merged together
– Bottom Up
• Divisive
– Initially all items in one cluster
– Large clusters are successively divided
– Top Down
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method does not require
the number of clusters k as an input, but needs a termination condition
 Illustrative Example:
 Agglomerative and divisive clustering on the data set {a, b, c, d ,e }
Step 0 Step 1 Step 2 Step 3 Step 4
Agglomerative

a
ab
b  Cluster distance
abcde
c  Termination
cde condition
d
de
e
Divisive
Step 4 Step 3 Step 2 Step 1 Step 0
Hierarchical Agglomerative Clustering
(HAC)
• Starts with each doc in a separate cluster
– then repeatedly joins the closest pair of clusters, until there is only one
cluster.
• The history of merging forms a binary tree or hierarchy.
How to measure distance of clusters??
Closest pair of clusters
Many variants to defining closest pair of clusters
• Single-link
– Distance of the “closest” points (single-link)
• Complete-link
– Distance of the “farthest” points
• Centroid
– Distance of the centroids (centers of gravity)
• (Average-link)
– Average distance between pairs of elements
Cluster Distance Measures
single link
• (min)
Single link: smallest distance between an
element in one cluster and an element in
the other, i.e., d(Ci, Cj) = min{d(xip, xjq)}

• Complete link: largest distance between complete link


an element in one cluster and an element (max)
in the other, i.e., d(Ci, Cj) = max{d(xip,
xjq)}

• Average: avg distance between elements


average
in one cluster and elements in the other,
i.e.,
d(Ci, Cj) = avg{d(xip, xjq)

d(C, C)=0
Dendrogram
• Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
• Each level shows clusters for
that level.
– Leaf – individual clusters
– Root – one cluster
• A cluster at level i is the union of
its children clusters at level i+1.
Cluster Distance Measures
Example: Given a data set of five objects characterized by a single continuous feature,
assume that there are two clusters: C1: {a, b} and C2: {c, d, e}.

a b c d e
Feature 1 2 4 5 6

1.Calculate the distance matrix.


2.Calculate three cluster distances between C 1 and C2.

Single link
a b c d e dist(C1 , C 2 )  min{ d(a, c) , d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
 min{3, 4, 5, 2, 3, 4}  2
a 0 1 3 4 5

b 1 0 2 3 4 Complete link
dist(C1 , C 2 )  max{d(a, c) , d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
c 3 2 0 1 2  max{3, 4, 5, 2, 3, 4}  5
d 4 3 1 0 1
Average d(a, c)  d(a, d)  d(a, e)  d(b, c)  d(b, d)  d(b, e)
dist(C1 , C 2 ) 
e 5 4 2 1 0 6
3  4  5  2  3  4 21
   3.5
6 6
Agglomerative Algorithm
• The Agglomerative algorithm is carried out in three steps:
1) Convert all object features into a
distance matrix
2) Set each object as a cluster (thus if
we have N objects, we will have N
clusters at the beginning)
3) Repeat until number of cluster is
one (or known # of clusters)
 Merge two closest clusters
 Update “distance matrix”
Example
• Problem: clustering analysis with agglomerative algorithm

data matrix

Euclidean distance
distance matrix
Example
• Merge two closest clusters (iteration 1)
Example
• Update distance matrix (iteration 1)
Example
• Merge two closest clusters (iteration 2)
Example
• Update distance matrix (iteration 2)
Example
• Merge two closest clusters/update distance matrix (iteration 3)
Example
• Merge two closest clusters/update distance matrix (iteration 4)
Example
• Final result (meeting termination condition)
Example
• Dendrogram tree representation

1. There are 6 clusters: A, B, C, D, E and


F
2. Merge clusters D and F into cluster (D,
6 F) at distance 0.50
3. Merge cluster A and cluster B into (A,
B) at distance 0.71
e
lifetim

4. Merge clusters E and (D, F) into ((D,


F), E) at distance 1.00
5 5. Merge clusters ((D, F), E) and C into
(((D, F), E), C) at distance 1.41
4 6. Merge clusters (((D, F), E), C) and (A,
B) into ((((D, F), E), C), (A, B))
3 at distance 2.50
2
7. The last cluster contain all the objects,
thus conclude the computation

object
Exercise
Given a data set of five objects characterised by a single continuous feature:
a b C d e
Feature 1 2 4 5 6

Apply the agglomerative algorithm with single-link, complete-link and averaging cluster
distance measures to produce three dendrogram trees, respectively.
a b c d e

a 0 1 3 4 5

b 1 0 2 3 4

c 3 2 0 1 2

d 4 3 1 0 1

e 5 4 2 1 0
Density-Based Clustering Algorithms
Density-Based Clustering
• Clustering based on density (local cluster criterion), such as density-
connected points or based on an explicitly constructed density
function
• This connected dense component which can grow in any direction
that density leads.
• Density, connectivity and boundary
• Arbitrary shaped clusters and good scalability
• Each cluster has a considerable higher density of points than outside
of the cluster
Major Features
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters
Two Major Types of Density-Based
Clustering Algorithms
• Connectivity based:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– CLIQUE: Agrawal, et al. (SIGMOD’98)

• Density function based:


- DENCLUE: Hinneburg & D. Keim (KDD’98/2006)
Density Based Clustering: Basic Concept
• Intuition for the formalization of the basic idea
– For any point in a cluster, the local point density around that point has to
exceed some threshold
– The set of points from one cluster is connected
• Local point density at a point p defined by two parameters
– ε – radius for the neighborhood of point p:
Nε (p) := {q in data set D | dist(p, q)  ε}
– MinPts – minimum number of points in the given neighbourhood N(p)
-Neighborhood
• -Neighborhood – Objects within a radius of  from an object.
N  ( p ) : {q | d ( p, q )   }
• “High density” - ε-Neighborhood of an object contains at least MinPts
of objects.

ε-Neighborhood of p
ε ε ε-Neighborhood of q
qq pp
Density of p is “high” (MinPts = 4)
Density of q is “low” (MinPts = 4)
Core, Border & Outlier
Given  and MinPts, categorize the
objects into three exclusive groups.

A point is a core point if it has more


than a specified number of points
(MinPts) within Eps These are points
that are at the interior of a cluster.

A border point has fewer than


MinPts within Eps, but is in the
neighborhood of a core point.

A noise point is any point that is


not a core point nor a border point.
Example
• M, P, O, and R are core objects since each is in an Eps neighborhood
containing at least 3 points

Minpts = 3
Eps=radius
of the circles
Density-Reachability

 Directly density-reachable
 An object q is directly density-reachable from object p if p is a
core object and q is in p’s -neighborhood.

 q is directly density-reachable from p


 p is not directly density- reachable from q?
q MinPts = 5
 Density-reachability is asymmetric.
p Eps = 1 cm
Density-Reachability
• Density-Reachable (directly and indirectly):
– A point p is directly density-reachable from p1;
– p1 is directly density-reachable from q; p
– pp1q form a chain.
p1
q
• p is (indirectly) density-reachable from q
• q is not density- reachable from p?

• Density-connected
– A point p is density-connected to a point q wrt. p q
Eps, MinPts if there is a point o such that both,
p and q are density-reachable from o wrt. Eps
and MinPts. o
Formal Description of Cluster
• Given a data set D, parameter  and threshold MinPts.
• A cluster C is a subset of objects satisfying two criteria:
– Connected: p, q C: p and q are density-connected.
– Maximal: p, q: if p C and q is density-reachable from p, then q C.
(avoid redundancy)

P is a core object.
Review of Concepts
Is an object o in a cluster or Are objects p and q in the
an outlier? same cluster?

Are p and q density-


Is o a core object?
connected?

Is o density-reachable by Are p and q density-


some core object? reachable by some object o?

Directly density- Indirectly density-reachable


reachable through a chain
DBSCAN Algorithm
Input: The data set D
Parameter: , MinPts
For each object p in D
if p is a core object and not processed then
C = retrieve all objects density-reachable from p
mark all objects in C as processed
report C as a cluster
else mark p as outlier
end if
End For

DBScan Algorithm
DBSCAN: The Algorithm
– Arbitrary select a point p

– Retrieve all points density-reachable from p wrt Eps and MinPts.

– If p is a core point, a cluster is formed.

– If p is a border point, no points are density-reachable from p and


DBSCAN visits the next point of the database.

– Continue the process until all of the points have been processed.
DBSCAN Algorithm: Example
• Parameter
• e = 2 cm
• MinPts = 3

for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable
from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example
• Parameter
• e = 2 cm
• MinPts = 3

for each o Î D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable
from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example
• Parameter
• e = 2 cm
• MinPts = 3

for each o Î D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable
from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Advantages
• DBSCAN does not require to specify the number of clusters in the data
apriori, as opposed to k-means.
• DBSCAN can find arbitrarily shaped clusters. It can even find a cluster
completely surrounded by (but not connected to) a different cluster. Due to
the MinPts parameter, the so-called single-link effect (different clusters being
connected by a thin line of points) is reduced.
• DBSCAN has a notion of noise, and is robust to outliers.
• DBSCAN requires just two parameters and is mostly insensitive to the
ordering of the points in the database. (However, points sitting on the edge of
two different clusters might swap cluster membership if the ordering of the
points is changed, and the cluster assignment is unique only up to
isomorphism.)
• The parameters minPts and ε can be set by a domain expert, if the data is well
understood.
DBSCAN Algorithm: Disadvantages
• DBSCAN is not entirely deterministic: border points that are reachable from
more than one cluster can be part of either cluster, depending on the order the
data is processed. Fortunately, this situation does not arise often, and has little
impact on the clustering result: both on core points and noise points,
DBSCAN is deterministic.
• The quality of DBSCAN depends on the distance measure used in the
function regionQuery (P, ε). The most common distance metric used
is Euclidean distance. Especially for high-dimensional data, this metric can be
rendered almost useless due to the so-called "Curse of dimensionality",
making it difficult to find an appropriate value for ε. This effect, however, is
also present in any other algorithm based on Euclidean distance.
• DBSCAN cannot cluster data sets well with large differences in densities,
since the minPts-ε combination cannot then be chosen appropriately for all
clusters.
• If the data and scale are not well understood, choosing a meaningful distance
threshold ε can be difficult.
Steps of Grid-based Clustering
Algorithms
Basic Grid-based Algorithm
1. Define a set of grid-cells
2. Assign objects to the appropriate grid cell and compute the density
of each cell.
3. Eliminate cells, whose density is below a certain threshold .
4. Form clusters from contiguous (adjacent) groups of dense cells
(usually minimizing a given objective function)
Advantages of Grid-based Clustering Algorithms
• fast:
– No distance computations
– Clustering is performed on summaries and not individual objects;
complexity is usually O(#-populated-grid-cells) and not O(#objects)
– Easy to determine which clusters are neighboring
• Shapes are limited to union of grid-cells
Grid-Based Clustering Methods
• Grid-based methods quantize the object space into a finite number of cells
that form a gird structure (Uses multi-resolution grid data structure).
• All the clustering operations are performed on the grid structure.
• Clustering complexity depends on the number of populated grid cells and
not on the number of objects in the dataset
• Several interesting methods (in addition to the basic grid-based algorithm)
– STING (a STatistical INformation Grid approach) by Wang, Yang and
Muntz (1997)
– CLIQUE: Agrawal, et al. (SIGMOD’98)
STING: A Statistical Information Grid
Approach
• Wang, Yang and Muntz (VLDB’97)
• The spatial area is divided into rectangular cells
• There are several levels of cells corresponding to different levels of
resolution
STING: A Statistical Information Grid
Approach (2)
– Each cell at a high level is partitioned into a number of smaller cells in the
next lower level
– Statistical info of each cell is calculated and stored beforehand and is
used to answer queries
– Parameters of higher level cells can be easily calculated from parameters
of lower level cell
• count, mean, s, min, max
• type of distribution—normal, uniform, etc.
– Use a top-down approach to answer spatial data queries
STING: Query Processing(3)
Used a top-down approach to answer spatial data queries
1.Start from a pre-selected layer—typically with a small number of cells
2.From the pre-selected layer until you reach the bottom layer do the following:
• For each cell in the current level compute the confidence interval
indicating a cell’s relevance to a given query;
– If it is relevant, include the cell in a cluster
– If it irrelevant, remove cell from further consideration
– otherwise, look for relevant cells at the next lower layer
3.Combine relevant cells into relevant regions (based on grid-neighborhood)
and return the so obtained clusters as your answers.
STING: A Statistical Information Grid
Approach (3)
– Advantages:
• Query-independent, easy to parallelize, incremental update
• O(K), where K is the number of grid cells at the lowest level
– Disadvantages:
• All the cluster boundaries are either horizontal or vertical, and no
diagonal boundary is detected

You might also like