Cluster Analysis
Cluster Analysis
An Introduction
This document is an introduction to Cluster Analysis. The objectives: 1. Understanding of the underlying concepts 2. Appreciation of when to use the technique 3. What Cluster Analysis can and cant do 4. The process of building a cluster solution 5. Evaluating a cluster solution 6. Implementing a cluster solution
Agenda
Introduction:
Why Segmentation ? Types of Segmentation: a) Objective Segmentation (CHAID). b) Subjective Segmentation (Cluster Analysis). What is Cluster Analysis ? Basic Concepts.
Agenda
SAS Procedures:
Proc Factor Proc Standard Proc Cluster / Proc Fastclus
Step A: Why Segmentation ? Step B: Types of Segmentation: 1. Objective Segmentation (CHAID) 2. Subjective Segmentation (Cluster Analysis) Step C: What is Cluster Analysis? Step D: Basic Concepts.
Each individual is so different that ideally we would want to reach out to each one of them in a different way
1 2 3 4 5 6
..
Solution : Identify segments where people have same characters and target each of these segments in a different way
Segmentation provides a catalyst for creative insights.....often the first step in marketing strategy planning. Segmentation is a multipurpose technique.
Segmentation provides a common vocabulary for communicating marketing analysis. Segmentation can complement models.
Segmentation
Objective
Subjective
CHAID
Cluster Analysis
Cluster Analysis is a technique used for combining observations into groups Such that
&
The process by which objects are classified into a number of groups so that they are as much dissimilar as possible from one group to another group, but as much similar as possible within each group. In other words Cluster analysis means dividing the whole population into groups which are distinct between themselves but internally similar.
Total Population
Group 1
Group 2
Group 3
Group 4
The objects in group 1 should be as similar as possible. But there should be much difference between an object in group 1 and group 2
The attributes of the objects are allowed to determine which objects should be grouped together
Avg. delinquency age = 0 days and Avg. age = 35 yrs. Avg. Utilization > 80%
Avg. delinquency age = 15 days and Avg. age = 33 yrs. Avg. Utilization = 60%
Avg. delinquency age = 12 days and Avg. age = 25 yrs. Avg. Utilization = 90%
Avg. delinquency age = 75 days and Avg. age = 50 yrs. Avg. Utilization = 40%
We can exclude the group with avg. delinquency age = 75 days from mailing This type of segmentation is known as Subjective Segmentation. It gives the salient characteristics of the best customers
Step D: Basic Concepts Basic Concepts of Cluster Analysis Using Two Variables
High
Current Balance
Medium
Cluster 1 and Cluster 2 are being differentiated by Income and Current Balance. The objects in Cluster 1 have similar characteristics (High Income and Low balance), on the other hand the objects in Cluster 2 have the same characteristic (High Balance and Low Income).
But there are much differences between an object in Cluster 1 and an object in Cluster 2
Cluster Sizes
The Cluster Analysis that was performed on 75,134 Logo 02 Accounts, resulted in 6 segments.
Segment Sizes
Segment 1 20%
Segment 2 29%
Significant Variables
The variables used for generating the segments are: 1. Account age (in days) 3. Time (in mths.) elapsed between last transaction and end of window (Recency) 4. Sum of all transaction amounts over last 6 months.
Account age (days) Total amt. of sale transactions - last 6 months (Bht) Recency (months) No. of times revolved last 3 months
Segment 3: New Hopefuls (23%) Newest accounts. Low Balance, Credit Utilisation, Fees and Fin. Charges. Low Revolvers. Low transactors in number and total amount transacted. But, sale per transaction slightly higher than average. Therefore, hold some hope for the future. Delinquency higher than average.
Segment 4: Affluent Spenders (5%) Highest monthly income. High Balance; but low Fees and Fin. Charges. Heavy transactors.
Segment Credit hungry poor New Revolvers New Hopefuls Affluent Spenders Vintage fuddy-duddies Old Risky Inactives
Note: Refer Behaviour Profile and Demographic Profile 1 in Appendix for Profile details.
1950
AVT (Value per txn. in Bht)
Affluent Spenders
5% of portfolio - 25% of all txn. value
1450
New Revolvers
28% of portfolio - 28% of all txn. value
950
Vintage Fuddy-Duddies
15% of portfolio - 14% of all txn. value
450
0.75
1.75
2.75
3.75
4.75
5.75
6.75
7.75
New Hopefuls
23% of portfolio - 19% of all txn. value
100 90 80 70 60 50 40 30 20 10 0
8 15 5 23
3 6 4 9
45
28 33 20
% of Portfolio
Credit hungry poor Affluent Spenders New Revolvers
% of Revolvers
New Hopefuls Old Risky Inactives
Vintage fuddy-duddies
Step 1: Data Cleaning and Preparing the data set for analysis. Step 2: Creating new relevant Variables. Step 3: Selection of Variables. Step 4: Tackling the Outliers. Step 5: Treatment of Missing Values. Step 6: Multicollinearity Check and hence reducing dimensions Step 7: Standardization of the selected variables Step 8: Getting Cluster Solution. Step 9: Checking the optimality of the solution.
Clustering Process
Selection of Variables
Step 3
Multicollinearity Check
Step 6
Standardization
Step 7
Server
Client Data Different Tables
Merged Data
Data Cleaning
Final Data. Ready for Analysis
Variable Types:
Demographic Socio-Economic Product Related Behavioral
Variable Creation New relevant variables, if necessary are to be created from the existing ones. As an example for auto loan portfolio, if there are variables like deposit amount and price of the vehicles under finance, a new variable Deposit Percent = (Deposit Amount / Price of the Vehicles)*100 can be created.
No Limit to # of variables to be selected for analysis Selection of Variables depends on the purpose of Clustering Irrelevant Variables are to be dropped Variables with large % of missing values are to be dropped
E.g. when we have Auto Loan portfolio, some variables like Loan Amount, Month on Book, Term, Deposit amount, Car Price should be considered. We should not look at APR (Annual Percentage Rate), because APR depends mainly on the yearly performance of the business overall rather than on the accounts. E.g. In a clustering Process, we want to identify the highly delinquent people. In this case Maximum Delinquency reached will have more significance than month end delinquency variables.
Step 4: Tackling Outliers What is an outlier ? An observation is said to be an outlier w.r.t. a variable if it is far away from the remaining observations.
Scatter Plot
Outlier
90 80 70 60
Var 2
To identify them:
Univariate and Frequency analysis Histogram and Box-Plot
50 40 30 20 10 0 0 5 10 15 20 Var 1 25 30 35 40 45
To tackle them: 1. The outliers can be deleted from analysis if they are very small in number. 2. The variables selected can be trimmed or capped.
Variables with lot many (about 15%) missing values should not be used for clustering unless Missing has a special significance and can be replaced by some meaningful number. E.g. - Insurance Variables. Note: - SAS does not include observations with missing values for Clustering Process % of Missing
Less than 1%
Treatments
Delete those Observations Mean Imputation
Mean Imputation
1-5%
5-10%
What is Multi-collinearity ? A set of independent or explanatory variables are said to have Multicollinearity, if there is any linear relation between them.
Factor Analysis: By Factor Analysis select those factors, which are explaining almost 90/95 % of total variation together. Then select those variables which have high loadings towards those factors.
VIF (Variance Inflation Factor): Variables with VIF more than 2 should be dropped
Step 7: Standardization
Why do we need Standardization ? Since the units of measurement are different for different variables, standardization is a must.
E.g.: - Consider two variables, Age and Income. The unit of Age is Year and the unit of Income is say $. Hence they are not comparable. In that case there wont be an unit of measurement for the distance between two clusters.
Cluster Process: In SAS there are two mostly used procedures namely Proc Fastclus and Proc Cluster. Simple Linkage Complete Linkage Two Stage Etc.
Proc Cluster
Proc Fastclus
K - Means
What is K-Means: The Process starts with K distinct observations which are at the highest distance from each other. Then each of the observations will be considered one by one. They will be clubbed to the nearest Cluster. In this way if two clusters come significantly close to each other, they will be merged to each other to form a new cluster.
Cluster Process: After cleaning up the data set from outliers any of the above procedures can be used to build clusters. There is no hard and fast rule in terms of cluster numbers and cluster sizes. But the rule of thumb is there should be 5% observations in each cluster and total number of clusters should be between 5 to 15. Some of the variables(present in the data set) are to used for clustering. These variables must be numeric. They may be continuous or discrete, but if discrete there must be an ordinality among the categories. The goodness of a particular set of clusters are to measured by the extent to which means of the clustering variables are differing from one cluster to another.
Profiling the Clusters: After building the clusters, they are to profiled with respect to discrete and continuous variables to identify the different features of the different clusters.
Meaning Ideal value Between Variation/Total >= 0.3 Variation Avg(Var R Square); 1 >= 0.6 Avg[WithinVariance(Var1), WithinVariance(Var2),] Similar to above, different formula; calculated assuming variables are independent. For each cluster: Sqrt{Avg[Variance(Var1), Variance(Var2),]} How close or how far apart are cluster centroids "Dispersion" within each cluster Close to Overall R Square (diff <= 0.1)
RMS STD
<= 1.1
>= 1.5
proc fastclus data=out1 out=out2 maxc=120 maxiter=100 delete=1200 short; var amt_fin term dep_per age mon_book; run;
Scoring
Minimum Euclidean Distance Method
Scatter Plot
80 70 60 50
Var 2
New Observation
40 30 20 10 0 0 5 10 15 20 25 Var 1 30 35 40 45 50
Thank You