0% found this document useful (0 votes)
10 views

Cluster Analysis

Uploaded by

mani
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Cluster Analysis

Uploaded by

mani
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

Chapter 4:

Cluster Analysis
LEARNING OBJECTIVES

Upon completing this chapter, you should be able to do the following:


• Define cluster analysis, its roles and its limitations.
• Identify the types of research questions addressed by cluster analysis.
• Understand how interobject similarity is measured.
• Understand why different distance measures are sometimes used.
• Understand the differences between hierarchical and nonhierarchical
clustering techniques.
• Know how to interpret the results from cluster analysis.
• Follow the guidelines for cluster validation.
Overview

What is Cluster Analysis?


How Does Cluster Analysis Work?
Cluster Analysis Decision Process
• Stage 1: Objectives of Cluster Analysis
• Stage 2: Research Design in Cluster Analysis
• Stage 3: Assumptions in Cluster Analysis
• Stage 4: Deriving Clusters and Assessing Overall Fit
• Stage 5: Interpretation of the Clusters
• Stage 6: Validation and Profiling of the Clusters
Implications of Big Data Analytics
An Illustrative Example
WHAT IS CLUSTER ANALYSIS?

 Cluster Analysis as a Multivariate Technique


 Conceptual Development with Cluster Analysis
 Necessity of Conceptual Support in Cluster Analysis
Cluster Analysis Defined
Definition
• Groups objects (respondents, products, firms, variables, etc.) so that objects in
same cluster are similar and different from objects in all the other clusters.
Cluster Analysis as a Multivariate Technique
• Cluster variate is the set of clustering variables used to measure similarity
Conceptual Development with Cluster analysis
• Data reduction – reduces population to smaller number of homogeneous groups
• Hypothesis Generation – means of developing or assessing hypotheses
Necessity of Conceptual Support
• Strong conceptual support of existence of clusters helps negate criticisms:
• Cluster analysis is descriptive, atheoretical, and non-inferential
• Cluster analysis will always create clusters, regardless of the actual existence of any structure
• The cluster solution is not generalizable because it is totally dependent upon cluster variate
What is Cluster Analysis?

Cluster analysis . . . is a group of multivariate techniques whose


primary purpose is to group objects based on the characteristics
they possess.

• It has been referred to as Q analysis, typology construction,


classification analysis, and numerical taxonomy.

• The essence of all clustering approaches is the classification of


data as suggested by “natural” groupings of the data themselves.
HOW DOES CLUSTER ANALYSIS WORK?

 A Simple Example
 Objective Versus Subjective Considerations
Scatter Diagram for Cluster Observations

Fundamental Question: How Many Clusters?

Frequency of eating out

Frequency of going to fast food restaurants


Potential Two, Three and Four Cluster Solutions

Frequency of eating
Frequency of eating

out
out

Frequency of going to fast food restaurants


Frequency of eating Frequency of going to fast food restaurants

Which one is
out

correct?

Frequency of going to fast food restaurants


Three Basic Questions In A Cluster Analysis
How do we measure similarity?
• We require a method of simultaneously comparing observations on the clustering
variables. Several methods are possible, including the correlation between objects
or perhaps a measure of their proximity in two-dimensional space such that the
distance between observations indicates similarity.
How do we form clusters?
• No matter how similarity is measured, the procedure must group those
observations that are most similar into a cluster, thereby determining the cluster
group membership of each observation for each set of clusters formed.
How many groups do we form?
• The final task is to select one cluster solution (i.e., set of clusters) as the final
solution. In doing so, the researcher faces a trade-off: fewer clusters and less
homogeneity within clusters versus a larger number of clusters and more within-
group homogeneity.
Three Cluster Diagram Showing Between-Cluster and
Within-Cluster Variation

Between-Cluster Variation = Maximize


Within-Cluster Variation = Minimize
Objective Versus Subjective Considerations

Cluster analysis has come under criticism for two key elements:

1. Subjectivity in selecting final solution


• While there are empirical diagnostic measures to assist the researcher in selecting the final
solution(s), there are no methods by which one solution is deemed optimal.
• Thus, it still falls to the researcher to make the final decision as to the number of clusters
• to accept as the final solution.

2. Judgment required of the researcher


• Judgment of the researcher in selecting the characteristics to be used, the methods of
combining clusters, and even the interpretation of cluster solutions makes any final solution
unique to that researcher.
CLUSTER ANALYSIS DECISION PROCESS

Stage 1: Objectives of Cluster Analysis


Stage 2: Research Design in Cluster Analysis
Stage 3: Assumptions in Cluster Analysis
Stage 4: Deriving Clusters and Assessing Overall Fit
Stage 5: Interpretation of the Clusters
Stage 6: Validation and Profiling of the Clusters
STAGE 1: OBJECTIVES OF CLUSTER ANALYSIS

 Research Questions in Cluster Analysis


 Selection of Clustering Variables
Primary Goal
• to partition a set of objects into two or more groups based on the
similarity of the objects for a set of specified characteristics (the cluster
variate).

Two key issues:


• The research questions being addressed, and
• The variables used to characterize objects in the clustering process.
Research Questions in Cluster Analysis

How to form the taxonomy


• creating an empirically based classification of objects.

How to simplify the data


• grouping observations for further analysis.

Which relationships can be identified


• revealing relationships among the observations within and between groups
Selection of Clustering Variables

Variable Selection Is Critical Decision


• Clustering variables represent to sole means of measuring similarity among objects
• As a result, the analysis is constrained based on the variables included.

Two Issues in Variable Selection . . .


1. Conceptual considerations
 Variables characterize the objects being clustered.
 Relate specifically to the objectives of the cluster analysis.

2. Practical considerations.
 Should always use the “best” variables available (i.e., little measurement error, etc.).
Rules of Thumb – Objectives of Cluster Analysis

• Cluster analysis is used for:


• Taxonomy description – identifying natural groups within the data.
• Data simplification – the ability to analyze groups of similar observations instead
of all individual observations.
• Relationship identification – the simplified structure from cluster analysis portrays
relationships not revealed otherwise.
• Theoretical, conceptual and practical considerations must be observed
when selecting clustering variables for cluster analysis:
• Only variables that relate specifically to objectives of the cluster analysis are
included, since “irrelevant” variables can not be excluded from the analysis once it
begins.
• Variables are selected which characterize the individuals (objects) being clustered.
STAGE 2: RESEARCH DESIGN IN CLUSTER ANALYSIS

 Types and Number of Clustering Variables


 Sample size
 Outlier detection
 Measuring object similarity
 Data standardization
Types and Number of Clustering Variables

Types of Variables Included


• Can employ either metric or non-metric, but generally not in mixed fashion.
• Multiple measures of similarity for each type.

Number of Clustering Variables


• Can suffer from “curse of dimensionality” when large number of variables analyzed.
• Can have impact with as few as 20 variables.

Relevancy of Clustering Variables


• No internal method of ascertaining the relevancy of clustering variables.
• Researcher should always include only those variables with strongest conceptual
support.
Is the Sample Size Adequate?

The sample size required is not based on statistical considerations for


inference testing, but rather:
• Sufficient size is needed to ensure representativeness of the population and its
underlying structure.
• Of particular interest is the ability to detect small groups within the population.
• Minimum group sizes are based on the relevance of each group to the research
question and the confidence needed in characterizing that group.

Increasing sample size (e.g., 1000 observations), however, may pose


problems for hierarchical clustering methods and require “hybrid”
approaches
Detecting Outliers
Outliers can severely distort the representativeness of the results if they appear
as structure (clusters) that are inconsistent with the research objectives
 They should be removed if the outlier represents:
 Aberrant observations not representative of the population.
 Observations of small or insignificant segments within the population.
 They should be retained if the outlier represents:
 an under-sampling/poor representation of relevant groups in the population. In this case, the
sample should be augmented to ensure representation of these groups.
Outliers can be identified based on the similarity measure by:
• Finding observations with large distances from all other observations – pairwise
similarities or summated measure of squared differences from mean of each
clustering variable.
• Graphic profile diagrams or parallel coordinate graphs highlighting outlying cases.
• Their appearance in cluster solutions as single-member or very small clusters.
Defining and Measuring Interobject Similarity

Interobject similarity
• an empirical measure of correspondence, or resemblance, between objects to be
clustered.
• calculated across the entire set of clustering variables to allow for the grouping of
observations and their comparison to each other.

Three methods most widely used in applications of cluster analysis:


• Distance measures – most often used.
• Correlational measures – less often used as they measure patterns, not distance.
• Association measures – applicable for non-metric clustering variables.
Types of Distance Measures

• Most widely used measure of similarity, with higher values representing


greater dissimilarity (distance between cases), not similarity.
• Many different distance measures, most common are:
 Euclidean (straight line) distance is the most common measure of distance.
 Squared Euclidean distance is the sum of squared distances and is the
recommended measure for the centroid and Ward’s methods of clustering.
 Mahalanobis distance (D2) accounts for variable intercorrelations and weights each
variable equally. When variables are highly intercorrelated, Mahalanobis distance
is most appropriate.
• Given the sensitivity of some procedures to the similarity measure used,
the researcher should employ several distance measures and compare the
results from each with other results or theoretical/known patterns.
Data Standardization

Clustering variables should be standardized whenever possible to avoid


problems resulting from the use of different scale values among clustering
variables.

Two approaches to standardization


• Relative to other cases:
• most common standardization is Z scores.
• Relative to other responses within an object
• If groups are to be identified according to an individual’s response style, then
within-case or row-centering standardization is appropriate.
STAGE 3: ASSUMPTIONS OF CLUSTER ANALYSIS

 Structure Exists
 Representativeness of the sample.
 Impact of multicollinearity.
Three Assumptions Underlying Cluster Analysis

Structure Exists
• Since cluster analysis will always generate a solution, researcher must assume that a
“natural” structure of objects exists which is to be identified by the technique.
Representativeness of the Sample
• Must be confident that the obtained sample is truly representative of the population.
Impact of multicollinearity
• Multicollinearity among subsets of variables is an implicit “weighting” of the
clustering variables
• Potential remedies for multicollinear subsets of variables
• Reduce the variables to equal numbers in each set of correlated measures.
• Use a distance measure that compensates for the correlation, like Mahalanobis Distance.
• Take a proactive approach and include only cluster variables that are not highly correlated.
STAGE 4: DERIVING CLUSTERS
AND ASSESSING OVERALL FIT
 Selecting the partitioning procedure.
 Potentially respecify initial cluster solutions by eliminating outliers or small clusters.
 Determining the number of clusters.
 Other clustering approaches.
Two Approaches to Partitioning Observations

Hierarchical
• Most common approach is where all objects start as separate clusters and then
are joined sequentially such that each step forms a new cluster joining by two
clusters at a time until only a single cluster remains.

Non-hierarchical
• the number of clusters is specified by the analyst and then the set of objects are
formed into that set of groupings.
Two Types of Hierarchical Clustering Procedures

Agglomerative Methods
• Buildup: all observations
start as individual
clusters, join together
sequentially.

Divisive Methods
• Breakdown: initially all
observations in a single
cluster, then divided
into smaller clusters.
How Agglomerative Hierarchical Approaches Work?

A multi-step process
• Start with all observations as their own cluster.
• Using the selected similarity measure and agglomerative algorithm,
combine the two most similar observations into a new cluster, now
containing two observations.
• Repeat the clustering procedure using the similarity
measure/agglomerative algorithm to combine the two most similar
observations or clusters (i.e., combinations of observations) into
another new cluster.
• Continue the process until all observations are in a single cluster.
Clustering Algorithms -- Agglomerative

Provides a method for combining clusters which have more than one
observation

Most widely used algorithms


• Single Linkage (nearest neighbor) – shortest distance from any object in one
cluster to any object in the other.
• Complete Linkage (farthest neighbor) – based on maximum distance between
observations in each cluster.
• Average Linkage – based on the average similarity of all individuals in a cluster.
• Centroid Method – measures distance between cluster centroids.
• Ward’s Method – based on the total sum of squares within clusters.
Single-Linkage Versus Complete Linkage

Single linkage: Similarity based only on Complete linkage: Similarity based only on
two closest observations. two farthest observations.
Comparing the Agglomerative Algorithms
Single-linkage
• probably the most versatile algorithm, but poorly delineated cluster structures
within the data produce unacceptable snakelike “chains” for clusters.
Complete linkage
• eliminates the chaining problem, but only considers the outermost observations in
a cluster, thus impacted by outliers.
Average linkage
• generates clusters with small within-cluster variation and less affected by outliers.
Centroid linkage
• like average linkage, is less affected by outliers.
Ward’s method
• most appropriate when the researcher expects somewhat equally sized clusters,
but easily distorted by outliers.
How Nonhierarchical Approaches Work?
1. Determine number of clusters to be extracted
2. Specify cluster seeds.
• Researcher specified.
• Sample generated:
• SAS FASTCLUS = first cluster seed is first observation in data set with no missing values.
• SPSS QUICK CLUSTER = seed points are selected randomly from all observations.
3. Assign each observation to one of the seeds based on similarity.
• Sequential Threshold = selects one seed point, develops cluster; then selects
next seed point and develops cluster, and so on. Observation cannot be re-
assigned to another cluster following its original assignment.
• Parallel Threshold = sets all seed points simultaneously, then develops clusters.
• Optimization = allow for re-assignment of observations based on the sequential
proximity of observations to clusters formed during the clustering process.
Pros and Cons of Hierarchical Methods

Pros
• Simplicity – generates tree-like structure which is simplistic portrayal of process.
• Measures of similarity – multiple measures to address many situations.
• Speed – generate entire set of cluster solutions in single analysis.

Cons
• Permanent combinations – once joined, clusters are never separated.
• Impact of outliers – outliers may appear as single object or very small clusters.
• Large samples – not amenable to very large samples, may require samples of large
populations.
Pros and Cons of Nonhierarchical Methods

Pros
• Results are less susceptible to:
• outliers in the data,
• the distance measure used, and
• the inclusion of irrelevant or inappropriate variables.
• Can easily analyze very large data sets

Cons
• Best results require knowledge of seed points.
• Difficult to guarantee optimal solution.
• Generates typically only spherical and more equally sized clusters.
• Less efficient in examining wide number of cluster solutions.
Selecting Between Hierarchical and Nonhierarchical

Hierarchical clustering solutions are preferred when:


• A wide range, even all, alternative clustering solutions is to be examined.
• The sample size is moderate (under 300-400, not exceeding 1,000) or a
sample of the larger dataset is acceptable.

Nonhierarchical clustering methods are preferred when:


• The number of clusters is known and initial seed points can be specified
according to some practical, objective or theoretical basis.
• There is concern about outliers since nonhierarchical methods generally
are less susceptible to outliers.
Combining Hierarchical and Nonhierachical Approaches

Combination approach – using a hierarchical approach followed by


a nonhierarchical approach is often advisable.

1. A nonhierarchical approach is used to select the number of clusters


and profile cluster centers that serve as initial cluster seeds in the
nonhierarchical procedure.

2. A nonhierarchical method then clusters all observations using the seed


points to provide more accurate cluster memberships.
Should The Cluster Analysis Be Respecified

Primary focus
• Identification of single object or very small clusters that represent disparate
observations that do not match the research objectives.
• Similar considerations to outlier identification and many times operate on the
same conditions in the sample.

If respecification occurs
• Should reanalyze remaining data, especially when using hierarchical procedures.
Determining the Number of Clusters

Stopping rules
• Criteria used with hierarchical techniques to identify potential cluster solutions.
• Foundational principle – a natural increase in heterogeneity comes from the
reduction in number of clusters.
• Common to all stopping rules:
• evaluating the trend in heterogeneity across cluster solutions to identify marked increases.
• substantive increases in this trend indicate relatively distinct clusters were joined and that the
cluster structure before joining is a potential candidate for the final solution.
• Issues in applying stopping rules
• The ad hoc procedures must be computed by the researcher and often involve fairly complex
approaches.
• Many times measures are software-specific.
Two Classes of Stopping Rules

Class 1: Measures of Heterogeneity Change


• measures heterogeneity change between cluster solutions at each successive
decrease in the number of clusters. A cluster solution is a candidate for the final
cluster solution when the heterogeneity change measure makes a sudden jump.
• Measures of heterogeneity change
• Percentage Changes in Heterogeneity – simple percentage change in heterogeneity.
• Measures of variance change – use of root mean square standard deviation (RMSSTD) to
compare solutions.
• Statistical measures of heterogeneity change – pseudo T2 statistic compares goodness-of-fit
between k and k-1 clusters. Thus, large pseudo T2 value at six clusters indicates a seven cluster
solution is the possible final solution.
Class 2: Direct Measures of Heterogeneity
• directly measure heterogeneity of each cluster solution and then allow analyst to
evaluate each cluster solution against a criterion measure.
• Measures of heterogeneity:
• Comparative cluster heterogeneity – the cubic clustering criterion (CCC) is a SAS measure of
the deviation of the clusters from an expected multivariate normal distribution. Choose cluster
solution(s) with high values of CCC.
• Statistical significance of cluster variation – pseudo F statistic measures the separation among
all the clusters by the ratio of between-cluster variance (separation of clusters) to within-
cluster variance (homogeneity of clusters). Higher values indicate a possible cluster solution.
• Internal validation index – characterize a cluster solution on two dimensions: separation and
compactness. Common measure is the Dunn index ratio, the ratio between the minimal
within-cluster distance to maximal between-cluster distance. Higher values indicate better
solutions.
Rules of Thumb – Deriving The Final Cluster Solution
No single objective procedure is available to determine the correct number of
clusters; rather the researcher can evaluate alternative cluster solutions on
two general types of stopping rules:
• Measures of heterogeneity change:
• These measures, whether they be percentage changes in heterogeneity, measures of variance
change (RMSSTD) or statistical measure of change (pseudo T2), all evaluate the change in
heterogeneity when moving from k to k - 1 clusters.
• Candidates for a final cluster solution are those cluster solutions which preceded a large increase
in heterogeneity by joining two clusters (i.e., a large change in heterogeneity going from k to k –
1 clusters would indicate that the k cluster solution is better).
• Direct measures of heterogeneity:
• These measures directly reflect the compactness and separation of a specific cluster solution.
These measures are compared across a range of cluster solutions, with the cluster solution(s)
exhibiting more compactness and separation being preferred.
• Among the most prevalent measures are the CCC (cubic clustering criterion), a statistical
measure of cluster variation (pseudo F statistic) or the internal validation index (Dunn's index).
Additional Approaches to Clustering

Density-based approach
• Fundamental principle – clusters can be identified by “dense” clusters of objects
within the sample, separated by regions of lower object density.
• Researcher must decide
• ε, the radius around a point that defines a point’s neighborhood, and
• the minimum number of objects (minObj) necessary within a neighborhood to define it a cluster.
• Has advantages of
• Ability to identify clusters of any arbitrary shape
• Ability to process very large samples,
• Requires specification of only two parameters,
• No prior knowledge of number of clusters,
• Explicit designation of outliers as separate from objects assigned to clusters,
• Applicable to a “mixed” set of clustering variables (i.e., both metric and nonmetric).
Additional Approaches to Clustering
Model-Based Approach
• varies from other approaches in that it is a statistical model versus algorithmic.
• uses differing probability distributions of objects as the basis for forming groups
rather than groupings of similarity in distance or high density.
• basic model – mixture model where objects are assumed to be represented by a
mixture of probability distributions (known as components), each representing a
different cluster.
• Advantages
• Can be applied to any combination of clustering variables (metric and/or nonmetric).
• Statistical tests are available to compare different models and determine best model fit to
define best cluster solution.
• Missing data can be directly handled.
• No scaling issues or transformations of variables needed.
• Once the cluster solution is finalized, can include antecedent/predictor and
outcome/validation variables.
STAGE 5: INTERPRETATION OF THE CLUSTERS
Cluster Interpretation

• Involves examining each cluster in terms of the cluster variate to name or


assign a label accurately describing the nature of the clusters.

• The cluster centroid, a mean profile of the cluster on each clustering


variable, is particularly useful in the interpretation stage.
• Interpretation involves examining the distinguishing characteristics of each
cluster’s profile and identifying substantial differences between clusters.
• Cluster solutions failing to show substantial variation indicate other cluster
solutions should be examined.

• The cluster centroid should also be assessed for correspondence with the
researcher’s prior expectations based on theory or practical experience.
STAGE 6: VALIDATION AND PROFILING OF THE
CLUSTERS
 Validation
 Profiling
Validation of the Final Cluster Solution

Validation is essential in cluster analysis since the clusters are descriptive of


structure and require additional support for their relevance.

Two approaches
• Cross-validation – empirically validates a cluster solution by creating two sub-
samples (randomly splitting the sample) and then comparing the two cluster
solutions for consistency with respect to number of clusters and the cluster
profiles.
• Criterion validity – achieved by examining differences on variables not included in
the cluster analysis but for which there is a theoretical and relevant reason to
expect variation across the clusters.
Profiling A Cluster Solution

Describing the characteristics of each cluster on a set of additional variables


(not the clustering variables) to further understand the differences between
clusters
• Examples include descriptive variables (e.g., demographics) as well as other
outcome-related measures.
• Provides insight to researchers as to nature and character of the clusters.

Clusters should differ on these relevant dimensions. This typically involves the
use of discriminant analysis or ANOVA.
Rules of Thumb – Deriving the Final Cluster Solution

There is no single objective procedure to determine the ‘correct’ number of


clusters.
Rather the researcher must evaluate alternative cluster solutions on the
following considerations to select the “best” solution:
• Single-member or extremely small clusters are generally not acceptable and
should generally be eliminated.
• For hierarchical methods, several ad hoc stopping rules are available to indicate
the number of clusters based on the rate of change in a total similarity measure as
the number of clusters increases or decreases or measures of heterogeneity.
• All clusters should be significantly different across the set of clustering variables.
• Cluster solutions ultimately must have theoretical validity assessed through
external validation.
IMPLICATIONS OF BIG DATA ANALYTICS
Implications for Big Data Analytics

Primary advantage
• Simplification by reducing the large number of observations into a much smaller
number of groupings from which the general nature and character of the entire
dataset can be observed.

Challenges
• Increasing sample sizes pose difficulties for clustering methods, particularly
hierarchical methods.
• Clustering high-dimensional data creates difficulties in:
• establishing object similarity.
• ensuring variable relevancy.
HBAT CLUSTER ILLUSTRATIVE EXAMPLE
Steps in Cluster Analysis

1. Select the variables.


2. Decide how many clusters to use.
3. Describe the characteristics of the derived clusters using demographics,
psychographics, etc.
Stage 1: Objectives of Cluster Analysis

Primary Objective
• develop a taxonomy that segments objects (HBAT customers) into groups with
similar perceptions.
• Once identified, strategies with different appeals can be formulated for the
separate groups—the requisite basis for market segmentation.

Clustering Variables
• surrogate variables based on EFA analysis (Chapter 3)
• X6 product quality (representative of the factor Product Value)
• X8 technical support (representative of the factor Technical Support)
• X12 salesforce image (representative of the factor Marketing)
• X15 new product development (not included in extracted factors)
• X18 delivery speed (representative of the factor Post-sale Customer Service)
Stages 2 and 3: Research Design and Assumtpions in Cluster Analysis

Stage 2: Research Design of Cluster Analysis


• Detecting Outliers
• Two observations – 6 and 87 – are candidates for outlier designation
• Defining Similarity
• Squared Euclidean distance chosen given that multicollinearity is minimal
• Sample Size
• Sample of 100 is sufficient to distinguish segments of relevant size (minimum 10%)
• Standardization
• No standardization employed since all variables on same response scale
Stage 3: Assumptions of Cluster Analysis
• Sample Representativeness
• Random sampling in data collection ensured sample representativeness
• Multicollinearity
• Minimized by using surrogate variables for factors found in EFA (see Chapter 3)
Stages 4-6: Hierarchical and Nonhierarchical Methods
A Two-Part Procedure

• Part 1: Partitioning
• A hierarchical procedure was used to identify a preliminary set of cluster
solutions as a basis for determining the appropriate number of clusters.

• Part 2: Fine Tuning


• Use of a nonhierarchical procedures to “fine-tune” the results and then profile
and validate the final cluster solution.

The hierarchical and nonhierarchical procedures from IBM SPSS and SAS are used
in this analysis, and comparable results would be obtained with most other
clustering programs.
Part 1: A Hierarchical Analysis – Initial Analysis

Initial Cluster Results


• Two potential outliers
identified earlier (cases 6
and 87) do not form a cluster
until stage 75 and then join
again at stage 89.
• As a result, these two cases
are designated as outliers
and omitted from the
subsequent analysis.
• The hierarchical method is
performed again with 98
observations included.
Part 1: A Hierarchical Analysis – Respecified Analysis
Various Stopping Rules:
• Proportionate Change in Heterogeneity –
four cluster solution seems most distinct,
but three and five cluster solutions also
possible
• Pseudo T2 – values indicate 4 or 5 clusters
• CCC – does not identify any distinct solution
• Pseudo F – little variation, possibly five
cluster solution
• Result – four cluster solution
Step 5: Profiling the Hierarchical Cluster Results
Four Cluster Solution
• Statistically significant differences across
clusters for all five clustering variables.
• Profiles
• Cluster 1 – 49 observations with relatively lower
mean on X15 (New Products) than the other three
clusters.
• Cluster 2 – 18 observations and is best
characterized by two variables: a very low mean
on X8 (Technical Support) and highest score on X15
(New Products).
• Cluster 3 – 14 observations and is best
characterized by a relatively low score on X6
(product quality).
• Cluster 4 – 17 observations and characterized by
a relatively low score on X12 (Salesforce Image)
Part 2: A Nonhierarchical Cluster Analysis

The results of Part 1 (Hierarchical Analysis) are used to determine the


number of clusters – 4

The nonhierarchical method is used to develop an “optimal” four cluster


solution

The resulting four cluster solution will be validated on a series of related


outcome measures
Stage 4: Deriving Clusters and Assessing Overall Fit

Two factors must be determined:

• How will seed points for clusters be determined?


• Employed sample-generating process, specifically randomly selected observations to act as
seed points.

• What clustering algorithm will be used?


• K-means optimizing algorithm.
• allows for reassignment of observations among clusters until a minimum level of heterogeneity is
reached. Using this algorithm, observations are initially grouped to the closest cluster seed. When all
observations are assigned, each observation is evaluated to see if it is still in the closest cluster. If it is
not, it is reassigned to a closer cluster. The process continues until the homogeneity within clusters
cannot be increased by further movement (reassignment) of observations between clusters.
Step 5: Profiling the Nonhierarchical Cluster Results
Four Cluster Solution
• Statistically significant differences across
clusters for four of the five clustering
variables
• Profiles
• Cluster 1 – 25 observations and most distinguished
by a relatively low mean for new products (X15).
Means for the other variables (except product
quality) also are relatively low.
• Cluster 2 – distinguished by relatively higher means
on technical support (X8) and product quality (X6).
• Cluster 3 – 17 observations and is distinguished by a
relatively higher mean for new products (X15) and
relatively lower mean for technical support (X8)
• Cluster 4 – 27 observations and distinguished by the
lowest mean of all clusters on product quality (X6)
and relatively average on other variables
Stage 6: Validation and Profiling the Clusters

Validation
• Stability – the four-group nonhierarchical cluster solution is examined by
comparing two different solutions using sample-generated seed points.
• Criterion validity – assess statistical significance of cluster differences on four
outcome measures of relevance to research objectives.
• MANOVA and individual ANOVAs all significant, indicating criterion validity.
Profiling
• Profiling of clusters on variables not included as clustering variables to assess
practical significance and theoretical basis of cluster solution.
• Five characteristics – X1 (Customer Type), X2 (Industry Type), X3 (Firm Size), X4
(Region), and X5 (Distribution System).
• Three characteristics (X1, X2 and X4)exhibited significant differences, thus providing
additional support to practical relevance and theoretical basis.
Cluster Analysis Learning Checkpoint

1. Why might we use cluster analysis?


2. What are the three major steps in cluster analysis?
3. How do you decide how many clusters to extract?
4. Why do we validate clusters?

You might also like