0% found this document useful (0 votes)
69 views

Cluster Analysis Concept & Methods

This document discusses cluster analysis methods for grouping observations based on similarity. It describes hierarchical clustering which starts with each observation as a cluster and merges them based on similarity, and k-means clustering which assigns observations to k predetermined clusters iteratively. Distance measures and algorithms for determining cluster similarity are also explained.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Cluster Analysis Concept & Methods

This document discusses cluster analysis methods for grouping observations based on similarity. It describes hierarchical clustering which starts with each observation as a cluster and merges them based on similarity, and k-means clustering which assigns observations to k predetermined clusters iteratively. Distance measures and algorithms for determining cluster similarity are also explained.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 14

Cluster Analysis

Concept & Methods


Cluster Analysis
• Cluster analysis
 A multivariate approach for grouping observations
based on similarity among measured variables.
 Cluster analysis is an important tool for identifying market
segments.
 Cluster analysis classifies individuals or objects into a small
number of mutually exclusive and exhaustive groups.
 Objects or individuals are assigned to groups so that there is
great similarity within groups and much less similarity
between groups.
 The cluster should have high internal (within-cluster)
homogeneity and external (between-cluster) heterogeneity.
EXHIBIT 24.7 Clusters of Individuals on Two Dimensions
Distance measures for individual observations

• To measure similarity between two observations a


distance measure is needed
• With a single variable, similarity is straightforward
• Example: income – two individuals are similar if their income level is
similar and the level of dissimilarity increases as the income gap
increases
• Multiple variables require an aggregate distance
measure
• Many characteristics (e.g. income, age, consumption habits, brand
loyalty, purchase frequency, family composition, education level, ..), it
becomes more difficult to define similarity with a single value
• The most known measure of distance is the Euclidean
distance, which is the concept we use in everyday life for
spatial coordinates.
Model:

Data: each object is characterized by a set of numbers


(measurements);
e.g., object 1: (x11, x12, … , x1n)
object 2: (x21, x22, … , x2n)
: :
object p: (xp1, xp2, … , xpn)

Distance: Euclidean distance, dij,

d ij  xi1  x j1 
2
  xi 2  x j 2 
2
    xin  x jn 
2
Three Cluster Diagram Showing
Between-Cluster and Within-Cluster Variation

Between-Cluster Variation = Maximize


Within-Cluster Variation = Minimize

© 2016 Cengage Learning India Pvt. Ltd. All rights reserved.


6
Hierarchical clustering
• Agglomerative:
• Each of the n observations constitutes a separate cluster
• The two clusters that are more similar according to some distance rule are
aggregated, so that in step 1 there are n-1 clusters
• In the second step another cluster is formed (n-2 clusters), by nesting the two
clusters that are more similar, and so on
• There is a merging in each step until all observations end up in a single cluster
in the final step.
• Divisive
• All observations are initially assumed to belong to a single cluster
• The most dissimilar observation(s) is extracted to form a separate cluster
• In step 1 there will be 2 clusters, in the second step three clusters and so on,
until the final step will produce as many clusters as the number of observations.

• The number of clusters determines the stopping rule


for the algorithms
Non-hierarchical clustering
• These algorithms do not follow a hierarchy and produce a
single partition
• Knowledge of the number of clusters (c) is required
• In the first step, initial cluster centres (the seeds) are
determined for each of the c clusters, either by the researcher
or by the software.
• Each iteration allocates observations to each of the c clusters,
based on their distance from the cluster centres
• Cluster centres are computed again and observations may be
reallocated to the nearest cluster in the next iteration
• When no observations can be reallocated or a stopping rule is
met, the process stops
Distance between clusters

• Algorithms vary according to the way the


distance between two clusters is defined.
• The most common algorithm for
hierarchical methods include
• centroid method
• single linkage method
• complete linkage method
• average linkage method
• Ward algorithm
Linkage methods

• Single linkage method (nearest neighbour):


distance between two clusters is the minimum
distance among all possible distances between
observations belonging to the two clusters.
• Complete linkage method (furthest neighbour):
nests two cluster using as a basis the maximum
distance between observations belonging to
separate clusters.
• Average linkage method: the distance between
two clusters is the average of all distances
between observations in the two clusters.
Ward algorithm

1. The sum of squared distances is computed


within each of the cluster, considering all
distances between observation within the same
cluster
2. The algorithm proceeds by choosing the
aggregation between two clusters which
generates the smallest increase in the total sum
of squared distances.
• It is a computationally intensive method,
because at each step all the sum of squared
distances need to be computed, together with
all potential increases in the total sum of
squared distances for each possible
aggregation of clusters.
Non-hierarchical clustering:
K-means method

 The number k of clusters is fixed


 An initial set of k “seeds” (aggregation centres) is
provided
 First k elements
 Given a certain fixed threshold, all units are
assigned to the nearest cluster seed
 New seeds are computed
 Go back to step 3 until no reclassification is
necessary
 Units can be reassigned in successive steps
(optimising partioning)
Hierarchical vs. non-hierarchical methods

Hierarchical Methods Non-hierarchical methods

 No decision about the number of  Faster, more reliable, works with


clusters large data sets
 Problems when data contain a high  Need to specify the number of
level of error clusters
 Can be very slow, preferable with  Need to set the initial seeds
small data-sets  Only cluster distances to seeds
 At each step they require need to be computed in each
computation of the full proximity iteration
matrix
How many clusters?

no hard and fast rules,


a. theoretical, conceptual, or practical considerations;
b. the distances at which clusters are combined in a
hierarchical clustering;
c. the relative size of the clusters should be meaningful,
etc.

You might also like