Cluster Analysis
Cluster Analysis
Introduction:
Cluster Analysis is an exploratory data analysis tool used to organize observed
data into various sub groups or clusters that are easier to manage than
individual datum. The grouping or clusters are defined through data analysis and
requires no prior knowledge of which item belongs to which group or cluster. It
does so by maximizing the similarity between elements within a group and
maximizing dissimilarity between different groups. Geometrically, objects within
a cluster will be close together and different clusters will be far apart. There are
various ways in which clusters can be formed. SPSS has 3 different methods that
can be used to perform cluster analysis: hierarchical cluster analysis, k-means
cluster and two-step cluster.
Flowchart
Example:
The dataset used in the example is shown below.
Name
Beefy
Benny
Bertie
Biffy
Billy
Champ
Charger
Charlie
Chewy
Chechee
Chico
Chief
Laddy
Larry
Lassie
Lemmy
Loco
LouLou
Weight.kilos
11.31
9.34
10.79
11.04
9.74
2.94
2.99
2.66
2.32
2.82
2.34
3.12
29.57
29.64
28.59
33.03
32.83
31.23
Height.cms
33.79
34.38
40.86
37.07
33.77
22.98
16.21
22.38
19.68
20.11
18.78
20.92
61.69
59.03
62.98
60.69
60.26
61.34
It has the Name, Weight in kilos and Height in centimetres of dogs. Our aim is to
categorize the dogs into different groups based on the above mentioned fields.
STEP2: Click Statistics option and check the options Agglomeration schedule
and Proximity matrix and click Continue.
Step3: Click Plots option and check Dendrogram, then click Continue
STEP4: Click Method and choose Wards method in the Cluster Method
dropdown menu and choose Squared Euclidean distance after checking
Interval option in the Measure category. Click Continue.
Missing
Percent
18
100.0
Total
Percent
0
.0
Percent
18
100.0
a. Ward Linkage
Proximity Matrix
Case
2:Benny
3:Bertie
4:Biffy
5:Billy
6:Champ
7:Charger
1:Beefy
.000
4.229
50.255
10.831
2.465
186.913
378.279
2:Benny
4.229
.000
44.093
10.126
.532
170.920
370.471
3:Bertie
50.255
44.093
.000
14.427
51.371
381.317
668.463
4:Biffy
10.831
10.126
14.427
.000
12.580
264.138
499.942
5:Billy
2.465
.532
51.371
12.580
.000
162.664
353.916
6:Champ
186.913
170.920
381.317
264.138
162.664
.000
45.835
7:Charger
378.279
370.471
668.463
499.942
353.916
45.835
.000
8:Charlie
205.011
188.622
407.607
286.021
179.859
.438
38.178
9:Chewy
279.912
265.370
520.333
378.451
253.585
11.274
12.490
10:Chechee
259.223
246.143
494.083
355.210
234.482
8.251
15.239
11:Chico
305.761
292.360
558.929
410.214
279.460
18.000
7.027
12:Chief
232.713
219.860
456.433
323.549
208.947
4.276
22.201
13:Laddy
1111.838
1155.089
786.577
949.505
1172.755
2207.621
2774.927
14:Larry
973.047
1019.713
685.471
828.202
1034.078
2012.492
2543.775
15:Lassie
1150.654
1188.522
806.134
979.331
1208.547
2257.922
2842.793
16:Lemmy
1195.368
1253.432
887.847
1041.465
1267.110
2327.452
2880.872
17:Loco
1163.771
1221.554
862.122
1012.580
1234.868
2283.211
2830.828
18:LouLou
1155.809
1206.014
837.224
996.669
1221.925
2271.814
2834.215
Proximity Matrix
Case
9:Chewy
10:Chechee
11:Chico
12:Chief
13:Laddy
14:Larry
1:Beefy
205.011
279.912
259.223
305.761
232.713
1111.838
973.047
2:Benny
188.622
265.370
246.143
292.360
219.860
1155.089
1019.713
3:Bertie
407.607
520.333
494.083
558.929
456.433
786.577
685.471
4:Biffy
286.021
378.451
355.210
410.214
323.549
949.505
828.202
5:Billy
179.859
253.585
234.482
279.460
208.947
1172.755
1034.078
.438
11.274
8.251
18.000
4.276
2207.621
2012.492
38.178
12.490
15.239
7.027
22.201
2774.927
2543.775
8:Charlie
.000
7.406
5.179
13.062
2.343
2269.424
2071.143
9:Chewy
7.406
.000
.435
.810
2.178
2507.403
2294.805
10:Chechee
5.179
.435
.000
1.999
.746
2444.459
2234.079
11:Chico
13.062
.810
1.999
.000
5.188
2582.741
2365.353
12:Chief
2.343
2.178
.746
5.188
.000
2361.795
2155.683
2269.424
2507.403
2444.459
2582.741
2361.795
.000
7.081
6:Champ
7:Charger
13:Laddy
14:Larry
2071.143
2294.805
2234.079
2365.353
2155.683
7.081
.000
15:Lassie
2320.725
2565.003
2501.930
2642.702
2417.764
2.625
16.705
16:Lemmy
2389.993
2624.924
2559.380
2698.324
2476.261
12.972
14.248
17:Loco
2345.123
2577.596
2512.623
2650.230
2430.320
12.673
11.689
18:LouLou
2334.127
2571.344
2507.041
2645.986
2423.949
2.878
7.864
Proximity Matrix
Case
16:Lemmy
17:Loco
18:LouLou
1:Beefy
1150.654
1195.368
1163.771
1155.809
2:Benny
1188.522
1253.432
1221.554
1206.014
3:Bertie
806.134
887.847
862.122
837.224
4:Biffy
979.331
1041.465
1012.580
996.669
5:Billy
1208.547
1267.110
1234.868
1221.925
6:Champ
2257.922
2327.452
2283.211
2271.814
7:Charger
2842.793
2880.872
2830.828
2834.215
8:Charlie
2320.725
2389.993
2345.123
2334.127
9:Chewy
2565.003
2624.924
2577.596
2571.344
10:Chechee
2501.930
2559.380
2512.623
2507.041
11:Chico
2642.702
2698.324
2650.230
2645.986
12:Chief
2417.764
2476.261
2430.320
2423.949
13:Laddy
2.625
12.972
12.673
2.878
14:Larry
16.705
14.248
11.689
7.864
.000
24.958
25.376
9.659
16:Lemmy
24.958
.000
.225
3.663
17:Loco
25.376
.225
.000
3.726
15:Lassie
18:LouLou
9.659
3.663
3.726
.000
NOTE: The proximity matrix shown above gives us the dissimilarity between different objects
in the dataset. So, the smaller the corresponding value, the larger is their similarity. For
example 16:Lemmy and 17:Loco have the dissimilarity value of 0.225, which is the
lowest in the entire matrix, indicating that they are the most similar objects in the entire
dataset and hence must belong to the same cluster.
Ward Linkage
Agglomeration Schedule
Stage
Cluster Combined
Cluster 1
Coefficients
Cluster 2
Next Stage
Cluster 2
16
17
.112
10
.330
.549
12
.815
11
1.679
13
15
2.991
11
12
4.749
12
6.892
15
16
18
9.317
13
10
16.531
15
11
13
14
24.022
13
12
34.561
14
13
13
16
49.276
11
17
14
67.473
12
16
15
98.032
10
16
16
1001.275
15
14
17
17
13
8167.574
16
13
NOTE: The Agglomeration Schedule tells us which objects were added to which cluster and
at which stage. As we saw in the proximity matrix, 16:Lemmy and 17:Loco had a
dissimilarity value of 0.225, which is the lowest in the entire matrix, indicating that they are
the most similar objects in the entire dataset and hence must belong to the same cluster.
Therefor they were the first two objects to be joined into a cluster. Unlike the proximity
matrix which gives us the dissimilarity between objects, the Agglomeration Schedule tells us
which objects were added to which cluster.
NOTE: The above diagram is called an icicle plot. It is a graphical representation of the
agglomeration schedule.
NOTE: The above Dendrogram shows how which objects belong to which cluster. It can be
seen from above that objects 16,17,18,13,15,14 form a cluster, 6,8,9,10,11,12,7 form a cluster
and 2,5,1,3,4 form a cluster. Also the last 2 clusters are more similar to each other in
comparison to any other.
2)
K-Means Clustering
K-means clustering, unlike hierarchical clustering, involves specifying the
number of clusters (k) required beforehand. Objects are then classified as
belonging to one of the k groups based on their distances from the cluster
centres or centroids. Objects are added to those clusters in which their objectcentroid distance is minimum.
Algorithm
STEP1: Number of clusters k is specified
STEP2: Centroid position or initial cluster centre for each of the k clusters
is calculated
STEP3: Centroid distance for each object is calculated
STEP4: Objects are added to that cluster where distance calculated is
STEP3 is minimum
STEP5: Repeat STEP2 until the computed values stabilize or maximum
number of iterations
is reached
Flowchart
Let k be the number of clusters.
Example:
The dataset used in the example is given below
Case
1
2
3
4
5
6
Sepal.Length
5.1
4.9
4.7
4.6
5.0
5.4
Sepal.Width
3.5
3.0
3.2
3.1
3.6
3.9
Petal.Length
1.4
1.4
1.3
1.5
1.4
1.7
Petal.Width
0.2
0.2
0.2
0.2
0.2
0.4
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
4.6
5.0
4.4
4.9
5.4
4.8
4.8
4.3
5.8
5.7
5.4
5.1
5.7
5.1
5.4
5.1
4.6
5.1
4.8
5.0
5.0
5.2
5.2
4.7
4.8
5.4
5.2
5.5
4.9
5.0
5.5
4.9
4.4
5.1
5.0
4.5
4.4
5.0
5.1
4.8
5.1
4.6
5.3
5.0
7.0
6.4
6.9
5.5
6.5
5.7
6.3
4.9
6.6
5.2
5.0
5.9
6.0
6.1
5.6
6.7
5.6
5.8
6.2
5.6
3.4
3.4
2.9
3.1
3.7
3.4
3.0
3.0
4.0
4.4
3.9
3.5
3.8
3.8
3.4
3.7
3.6
3.3
3.4
3.0
3.4
3.5
3.4
3.2
3.1
3.4
4.1
4.2
3.1
3.2
3.5
3.6
3.0
3.4
3.5
2.3
3.2
3.5
3.8
3.0
3.8
3.2
3.7
3.3
3.2
3.2
3.1
2.3
2.8
2.8
3.3
2.4
2.9
2.7
2.0
3.0
2.2
2.9
2.9
3.1
3.0
2.7
2.2
2.5
1.4
1.5
1.4
1.5
1.5
1.6
1.4
1.1
1.2
1.5
1.3
1.4
1.7
1.5
1.7
1.5
1.0
1.7
1.9
1.6
1.6
1.5
1.4
1.6
1.6
1.5
1.5
1.4
1.5
1.2
1.3
1.4
1.3
1.5
1.3
1.3
1.3
1.6
1.9
1.4
1.6
1.4
1.5
1.4
4.7
4.5
4.9
4.0
4.6
4.5
4.7
3.3
4.6
3.9
3.5
4.2
4.0
4.7
3.6
4.4
4.5
4.1
4.5
3.9
0.3
0.2
0.2
0.1
0.2
0.2
0.1
0.1
0.2
0.4
0.4
0.3
0.3
0.3
0.2
0.4
0.2
0.5
0.2
0.2
0.4
0.2
0.2
0.2
0.2
0.4
0.1
0.2
0.2
0.2
0.2
0.1
0.2
0.2
0.3
0.3
0.2
0.6
0.4
0.3
0.2
0.2
0.2
0.2
1.4
1.5
1.5
1.3
1.5
1.3
1.6
1.0
1.3
1.4
1.0
1.5
1.0
1.4
1.3
1.4
1.5
1.0
1.5
1.1
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
5.9
6.1
6.3
6.1
6.4
6.6
6.8
6.7
6.0
5.7
5.5
5.5
5.8
6.0
5.4
6.0
6.7
6.3
5.6
5.5
5.5
6.1
5.8
5.0
5.6
5.7
5.7
6.2
5.1
5.7
6.3
5.8
7.1
6.3
6.5
7.6
4.9
7.3
6.7
7.2
6.5
6.4
6.8
5.7
5.8
6.4
6.5
7.7
7.7
6.0
6.9
5.6
7.7
6.3
6.7
7.2
6.2
6.1
6.4
7.2
7.4
7.9
6.4
6.3
3.2
2.8
2.5
2.8
2.9
3.0
2.8
3.0
2.9
2.6
2.4
2.4
2.7
2.7
3.0
3.4
3.1
2.3
3.0
2.5
2.6
3.0
2.6
2.3
2.7
3.0
2.9
2.9
2.5
2.8
3.3
2.7
3.0
2.9
3.0
3.0
2.5
2.9
2.5
3.6
3.2
2.7
3.0
2.5
2.8
3.2
3.0
3.8
2.6
2.2
3.2
2.8
2.8
2.7
3.3
3.2
2.8
3.0
2.8
3.0
2.8
3.8
2.8
2.8
4.8
4.0
4.9
4.7
4.3
4.4
4.8
5.0
4.5
3.5
3.8
3.7
3.9
5.1
4.5
4.5
4.7
4.4
4.1
4.0
4.4
4.6
4.0
3.3
4.2
4.2
4.2
4.3
3.0
4.1
6.0
5.1
5.9
5.6
5.8
6.6
4.5
6.3
5.8
6.1
5.1
5.3
5.5
5.0
5.1
5.3
5.5
6.7
6.9
5.0
5.7
4.9
6.7
4.9
5.7
6.0
4.8
4.9
5.6
5.8
6.1
6.4
5.6
5.1
1.8
1.3
1.5
1.2
1.3
1.4
1.4
1.7
1.5
1.0
1.1
1.0
1.2
1.6
1.5
1.6
1.5
1.3
1.3
1.3
1.2
1.4
1.2
1.0
1.3
1.2
1.3
1.3
1.1
1.3
2.5
1.9
2.1
1.8
2.2
2.1
1.7
1.8
1.8
2.5
2.0
1.9
2.1
2.0
2.4
2.3
1.8
2.2
2.3
1.5
2.3
2.0
2.0
1.8
2.1
1.8
1.8
1.8
2.1
1.6
1.9
2.0
2.2
1.5
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
6.1
7.7
6.3
6.4
6.0
6.9
6.7
6.9
5.8
6.8
6.7
6.7
6.3
6.5
6.2
5.9
2.6
3.0
3.4
3.1
3.0
3.1
3.1
3.1
2.7
3.2
3.3
3.0
2.5
3.0
3.4
3.0
5.6
6.1
5.6
5.5
4.8
5.4
5.6
5.1
5.1
5.9
5.7
5.2
5.0
5.2
5.4
5.1
1.4
2.3
2.4
1.8
1.8
2.1
2.4
2.3
1.9
2.3
2.5
2.3
1.9
2.0
2.3
1.8
The tables gives us the physical attributes of a sample of flowers, our aim
is to categorize them into 3 categories of flowers based on the given
physical attributes.
STEP1: Choose Classify option in Analyze menu, then choose K-Means
Cluster.... A window, as shown below, opens. Enter 3 in the field
Number of Clusters
STEP2: Click Iterate option and enter 10 (10 is chosen for the purpose
of demonstration, any number can be chosen) in the Maximum
Iterations field, click Continue.
STEP3: Click Save option, check the options Cluster membership and
Distance from cluster center, and click Continue
Doing so will create 2 new fields in the original data file which will give us
the Cluster membership (which tells us which cluster the object belongs
to ) and Distance from cluster center
STEP5: Click Options, check the options Initial cluster centers and
Cluster Information for each case, click Continue
2
7.7
3
5.7
4.9
Sepal.Width
3.8
4.4
2.5
Petal.Length
6.7
1.5
4.5
Petal.Width
2.2
.4
1.7
Iteration Historya
Iteration
1.226
1.205
1.141
.175
.000
.121
.070
.000
.047
.050
.000
.033
.000
.000
.000
NOTE: The table below is the required result. The values under the column names Cluster
tells us which cluster the Case belongs to. For example: Case Numbers 1,2, 3 etc. belong
to cluster 2 AND Case Number 53 belongs to cluster 1.
Cluster Membership
Case Number
Case
Cluster
Distance
.141
.448
.417
.525
.189
.677
.415
.066
.807
10
10
.376
11
11
.482
12
12
.254
13
13
.501
14
14
.913
15
15
1.014
16
16
1.205
17
17
.654
18
18
.144
19
19
.824
20
20
.389
21
21
.463
22
22
.329
23
23
.640
24
24
.383
25
25
.487
26
26
.452
27
27
.209
28
28
.215
29
29
.211
30
30
.408
31
31
.414
32
32
.426
33
33
.716
34
34
.920
35
35
.350
36
36
.350
37
37
.527
38
38
.257
39
39
.761
40
40
.115
41
41
.185
42
42
1.248
43
43
.669
44
44
.387
45
45
.602
46
46
.482
47
47
.410
48
48
.472
49
49
.405
50
50
.150
51
51
1.227
52
52
.684
53
53
1.019
54
54
.732
55
55
.639
56
56
.269
57
57
.765
58
58
1.584
59
59
.756
60
60
.860
61
61
1.536
62
62
.324
63
63
.808
64
64
.397
65
65
.873
66
66
.873
67
67
.412
68
68
.536
69
69
.637
70
70
.713
71
71
.709
72
72
.463
73
73
.694
74
74
.437
75
75
.546
76
76
.743
77
77
.988
78
78
.846
79
79
.220
80
80
1.024
81
81
.864
82
82
.976
83
83
.558
84
84
.734
85
85
.575
86
86
.688
87
87
.927
88
88
.615
89
89
.508
90
90
.629
91
91
.488
92
92
.383
93
93
.492
94
94
1.549
95
95
.386
96
96
.443
97
97
.345
98
98
.372
99
99
1.661
100
100
.384
101
101
.777
102
102
.854
103
103
.306
104
104
.653
105
105
.385
106
106
1.142
107
107
1.071
108
108
.786
109
109
.655
110
110
.844
111
111
.746
112
112
.753
113
113
.260
114
114
.889
115
115
1.202
116
116
.683
117
117
.510
118
118
1.478
119
119
1.530
120
120
.826
121
121
.270
122
122
.819
123
123
1.311
124
124
.743
125
125
.276
126
126
.528
127
127
.625
128
128
.702
129
129
.546
130
130
.594
131
131
.731
132
132
1.438
133
133
.561
134
134
.815
135
135
1.121
136
136
.953
137
137
.733
138
138
.579
139
139
.610
140
140
.348
141
141
.389
142
142
.684
143
143
.854
144
144
.310
145
145
.509
146
146
.612
147
147
.897
148
148
.653
149
149
.836
150
150
.835
Sepal.Length
6.9
5.0
5.9
Sepal.Width
3.1
3.4
2.7
Petal.Length
5.7
1.5
4.4
Petal.Width
2.1
.2
1.4
5.018
5.018
1.797
1.797
3.357
3.357
Cluster
Valid
Missing
38.000
50.000
62.000
150.000
.000
3)Two-Step Cluster
For datasets that are very large or when a fast clustering process is required
which can develop clusters on the basis of continuous (like age, salary etc) or
categorical data (like gender, marital status etc) , none of the above
mentioned cluster analyses procedures fulfil the requirement. a matrix of
distances between all pairs of cases is required for Hierarchical Cluster
Analysis whereas shuffling of cases in and out of clusters and number of
clusters should be known in advance for K-Means clustering. Two-Step Cluster
Analysis is designed to fulfil both the above requirements. It requires only one
pass of data (which is important for very large data files), and it can produce
solutions based on mixtures of continuous and categorical variables and for
varying numbers of clusters.
Algorithm
STEP1: Preclusters, which are clusters of the original case which are used
in place of raw data in hierarchical clustering, are formed
STEP2: Hierarchical clustering, of Preclusters formed in STEP1, is carried
out
Example:
The dataset used in the example is given below
Name
Oscar
Susie
Kimberly
Louise
Ronald
Charlie
Gertrude
Beatrice
Queenie
Thomas
Harry
Daisy
Ethel
Angus
Morris
John
Noel
Fred
Peter
Ian
age
4.00
4.00
4.10
5.50
5.50
5.50
5.70
5.90
5.90
5.90
6.15
6.20
6.40
6.70
6.90
6.90
7.20
7.30
7.30
7.50
mem_span
4.20
4.20
3.90
4.20
4.20
4.10
3.60
4.00
4.00
4.00
5.00
4.80
5.00
4.40
4.50
5.00
5.00
5.50
5.50
5.40
iq
101.00
101.00
108.00
90.00
90.00
105.00
88.00
90.00
90.00
90.00
95.00
98.00
106.00
95.00
91.00
104.00
92.00
100.00
100.00
96.00
read_ab
5.60
5.60
5.00
5.80
5.80
6.00
5.30
6.00
6.00
6.00
6.40
6.60
7.00
7.20
6.60
7.30
6.80
7.20
7.20
6.60
Gender
1.00
2.00
2.00
2.00
1.00
1.00
2.00
2.00
2.00
1.00
1.00
2.00
2.00
1.00
2.00
1.00
1.00
1.00
1.00
1.00
The dataset has the fields age, memory span, IQ, reading ability and gender
of students. Our aim is to categorize them into 3 clusters based on age,
mem_span, read_ab, gender and evaluate the 3 clusters on their IQ values.
STEP2: Click Output option. Add IQ to the Evaluation Fields (since we want
to evaluate the formed clusters on the basis of IQ). Check the option Charts and
tables in Model Viewer. Click Continue