Cluster Analysis
Cluster Analysis
Cluster Analysis
LEARNING OBJECTIVES
A Simple Example
Objective Versus Subjective Considerations
Scatter Diagram for Cluster Observations
Frequency of eating
Frequency of eating
out
out
Which one is
out
correct?
Cluster analysis has come under criticism for two key elements:
2. Practical considerations.
Should always use the “best” variables available (i.e., little measurement error, etc.).
Rules of Thumb – Objectives of Cluster Analysis
Interobject similarity
• an empirical measure of correspondence, or resemblance, between objects to be
clustered.
• calculated across the entire set of clustering variables to allow for the grouping of
observations and their comparison to each other.
Structure Exists
Representativeness of the sample.
Impact of multicollinearity.
Three Assumptions Underlying Cluster Analysis
Structure Exists
• Since cluster analysis will always generate a solution, researcher must assume that a
“natural” structure of objects exists which is to be identified by the technique.
Representativeness of the Sample
• Must be confident that the obtained sample is truly representative of the population.
Impact of multicollinearity
• Multicollinearity among subsets of variables is an implicit “weighting” of the
clustering variables
• Potential remedies for multicollinear subsets of variables
• Reduce the variables to equal numbers in each set of correlated measures.
• Use a distance measure that compensates for the correlation, like Mahalanobis Distance.
• Take a proactive approach and include only cluster variables that are not highly correlated.
STAGE 4: DERIVING CLUSTERS
AND ASSESSING OVERALL FIT
Selecting the partitioning procedure.
Potentially respecify initial cluster solutions by eliminating outliers or small clusters.
Determining the number of clusters.
Other clustering approaches.
Two Approaches to Partitioning Observations
Hierarchical
• Most common approach is where all objects start as separate clusters and then
are joined sequentially such that each step forms a new cluster joining by two
clusters at a time until only a single cluster remains.
Non-hierarchical
• the number of clusters is specified by the analyst and then the set of objects are
formed into that set of groupings.
Two Types of Hierarchical Clustering Procedures
Agglomerative Methods
• Buildup: all observations
start as individual
clusters, join together
sequentially.
Divisive Methods
• Breakdown: initially all
observations in a single
cluster, then divided
into smaller clusters.
How Agglomerative Hierarchical Approaches Work?
A multi-step process
• Start with all observations as their own cluster.
• Using the selected similarity measure and agglomerative algorithm,
combine the two most similar observations into a new cluster, now
containing two observations.
• Repeat the clustering procedure using the similarity
measure/agglomerative algorithm to combine the two most similar
observations or clusters (i.e., combinations of observations) into
another new cluster.
• Continue the process until all observations are in a single cluster.
Clustering Algorithms -- Agglomerative
Provides a method for combining clusters which have more than one
observation
Single linkage: Similarity based only on Complete linkage: Similarity based only on
two closest observations. two farthest observations.
Comparing the Agglomerative Algorithms
Single-linkage
• probably the most versatile algorithm, but poorly delineated cluster structures
within the data produce unacceptable snakelike “chains” for clusters.
Complete linkage
• eliminates the chaining problem, but only considers the outermost observations in
a cluster, thus impacted by outliers.
Average linkage
• generates clusters with small within-cluster variation and less affected by outliers.
Centroid linkage
• like average linkage, is less affected by outliers.
Ward’s method
• most appropriate when the researcher expects somewhat equally sized clusters,
but easily distorted by outliers.
How Nonhierarchical Approaches Work?
1. Determine number of clusters to be extracted
2. Specify cluster seeds.
• Researcher specified.
• Sample generated:
• SAS FASTCLUS = first cluster seed is first observation in data set with no missing values.
• SPSS QUICK CLUSTER = seed points are selected randomly from all observations.
3. Assign each observation to one of the seeds based on similarity.
• Sequential Threshold = selects one seed point, develops cluster; then selects
next seed point and develops cluster, and so on. Observation cannot be re-
assigned to another cluster following its original assignment.
• Parallel Threshold = sets all seed points simultaneously, then develops clusters.
• Optimization = allow for re-assignment of observations based on the sequential
proximity of observations to clusters formed during the clustering process.
Pros and Cons of Hierarchical Methods
Pros
• Simplicity – generates tree-like structure which is simplistic portrayal of process.
• Measures of similarity – multiple measures to address many situations.
• Speed – generate entire set of cluster solutions in single analysis.
Cons
• Permanent combinations – once joined, clusters are never separated.
• Impact of outliers – outliers may appear as single object or very small clusters.
• Large samples – not amenable to very large samples, may require samples of large
populations.
Pros and Cons of Nonhierarchical Methods
Pros
• Results are less susceptible to:
• outliers in the data,
• the distance measure used, and
• the inclusion of irrelevant or inappropriate variables.
• Can easily analyze very large data sets
Cons
• Best results require knowledge of seed points.
• Difficult to guarantee optimal solution.
• Generates typically only spherical and more equally sized clusters.
• Less efficient in examining wide number of cluster solutions.
Selecting Between Hierarchical and Nonhierarchical
Primary focus
• Identification of single object or very small clusters that represent disparate
observations that do not match the research objectives.
• Similar considerations to outlier identification and many times operate on the
same conditions in the sample.
If respecification occurs
• Should reanalyze remaining data, especially when using hierarchical procedures.
Determining the Number of Clusters
Stopping rules
• Criteria used with hierarchical techniques to identify potential cluster solutions.
• Foundational principle – a natural increase in heterogeneity comes from the
reduction in number of clusters.
• Common to all stopping rules:
• evaluating the trend in heterogeneity across cluster solutions to identify marked increases.
• substantive increases in this trend indicate relatively distinct clusters were joined and that the
cluster structure before joining is a potential candidate for the final solution.
• Issues in applying stopping rules
• The ad hoc procedures must be computed by the researcher and often involve fairly complex
approaches.
• Many times measures are software-specific.
Two Classes of Stopping Rules
Density-based approach
• Fundamental principle – clusters can be identified by “dense” clusters of objects
within the sample, separated by regions of lower object density.
• Researcher must decide
• ε, the radius around a point that defines a point’s neighborhood, and
• the minimum number of objects (minObj) necessary within a neighborhood to define it a cluster.
• Has advantages of
• Ability to identify clusters of any arbitrary shape
• Ability to process very large samples,
• Requires specification of only two parameters,
• No prior knowledge of number of clusters,
• Explicit designation of outliers as separate from objects assigned to clusters,
• Applicable to a “mixed” set of clustering variables (i.e., both metric and nonmetric).
Additional Approaches to Clustering
Model-Based Approach
• varies from other approaches in that it is a statistical model versus algorithmic.
• uses differing probability distributions of objects as the basis for forming groups
rather than groupings of similarity in distance or high density.
• basic model – mixture model where objects are assumed to be represented by a
mixture of probability distributions (known as components), each representing a
different cluster.
• Advantages
• Can be applied to any combination of clustering variables (metric and/or nonmetric).
• Statistical tests are available to compare different models and determine best model fit to
define best cluster solution.
• Missing data can be directly handled.
• No scaling issues or transformations of variables needed.
• Once the cluster solution is finalized, can include antecedent/predictor and
outcome/validation variables.
STAGE 5: INTERPRETATION OF THE CLUSTERS
Cluster Interpretation
• The cluster centroid should also be assessed for correspondence with the
researcher’s prior expectations based on theory or practical experience.
STAGE 6: VALIDATION AND PROFILING OF THE
CLUSTERS
Validation
Profiling
Validation of the Final Cluster Solution
Two approaches
• Cross-validation – empirically validates a cluster solution by creating two sub-
samples (randomly splitting the sample) and then comparing the two cluster
solutions for consistency with respect to number of clusters and the cluster
profiles.
• Criterion validity – achieved by examining differences on variables not included in
the cluster analysis but for which there is a theoretical and relevant reason to
expect variation across the clusters.
Profiling A Cluster Solution
Clusters should differ on these relevant dimensions. This typically involves the
use of discriminant analysis or ANOVA.
Rules of Thumb – Deriving the Final Cluster Solution
Primary advantage
• Simplification by reducing the large number of observations into a much smaller
number of groupings from which the general nature and character of the entire
dataset can be observed.
Challenges
• Increasing sample sizes pose difficulties for clustering methods, particularly
hierarchical methods.
• Clustering high-dimensional data creates difficulties in:
• establishing object similarity.
• ensuring variable relevancy.
HBAT CLUSTER ILLUSTRATIVE EXAMPLE
Steps in Cluster Analysis
Primary Objective
• develop a taxonomy that segments objects (HBAT customers) into groups with
similar perceptions.
• Once identified, strategies with different appeals can be formulated for the
separate groups—the requisite basis for market segmentation.
Clustering Variables
• surrogate variables based on EFA analysis (Chapter 3)
• X6 product quality (representative of the factor Product Value)
• X8 technical support (representative of the factor Technical Support)
• X12 salesforce image (representative of the factor Marketing)
• X15 new product development (not included in extracted factors)
• X18 delivery speed (representative of the factor Post-sale Customer Service)
Stages 2 and 3: Research Design and Assumtpions in Cluster Analysis
• Part 1: Partitioning
• A hierarchical procedure was used to identify a preliminary set of cluster
solutions as a basis for determining the appropriate number of clusters.
The hierarchical and nonhierarchical procedures from IBM SPSS and SAS are used
in this analysis, and comparable results would be obtained with most other
clustering programs.
Part 1: A Hierarchical Analysis – Initial Analysis
Validation
• Stability – the four-group nonhierarchical cluster solution is examined by
comparing two different solutions using sample-generated seed points.
• Criterion validity – assess statistical significance of cluster differences on four
outcome measures of relevance to research objectives.
• MANOVA and individual ANOVAs all significant, indicating criterion validity.
Profiling
• Profiling of clusters on variables not included as clustering variables to assess
practical significance and theoretical basis of cluster solution.
• Five characteristics – X1 (Customer Type), X2 (Industry Type), X3 (Firm Size), X4
(Region), and X5 (Distribution System).
• Three characteristics (X1, X2 and X4)exhibited significant differences, thus providing
additional support to practical relevance and theoretical basis.
Cluster Analysis Learning Checkpoint