K-Means Cluster Analysis UC Business Analytics R Programming Guide
K-Means Cluster Analysis UC Business Analytics R Programming Guide
tl;dr
This tutorial serves as an introduction to the k-means
clustering method.
Replication Requirements
To replicate this tutorial’s analysis you will need to load
the following packages:
https://uc-r.github.io/kmeans_clustering 1/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide
Data Preparation
To perform a cluster analysis in R, generally, the data
should be prepared as follows:
df <- USArrests
df <- na.omit(df)
df <- scale(df)
head(df)
https://uc-r.github.io/kmeans_clustering 2/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide
Euclidean distance:
n
2
deuc (x, y) = ∑ (xi − yi ) (1)
⎷
i=1
Manhattan distance:
n
i=1
https://uc-r.github.io/kmeans_clustering 3/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide
Where x ′
i
= rank(xi ) and y ′
i
= rank(yi ) .
nc − nd
dkend (x, y) = 1 − (5)
1
n(n − 1)
2
https://uc-r.github.io/kmeans_clustering 4/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide
K-Means Clustering
K-means clustering is the most commonly used
unsupervised machine learning algorithm for partitioning
a given data set into a set of k groups (i.e. k clusters),
where k represents the number of groups pre-specified by
the analyst. It classifies objects in multiple groups (i.e.,
clusters), such that objects within the same cluster are as
similar as possible (i.e., high intra-class similarity),
whereas objects from different clusters are as dissimilar as
possible (i.e., low inter-class similarity). In k-means
clustering, each cluster is represented by its center (i.e,
centroid) which corresponds to the mean of points
assigned to the cluster.
2
W (Ck ) = ∑ (xi − μk ) (6)
xi ∈Ck
https://uc-r.github.io/kmeans_clustering 5/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide
where:
k k
2
tot. withiness = ∑ W (Ck ) = ∑ ∑ (xi − μk ) (7)
K-means Algorithm
The first step when using k-means clustering is to indicate
the number of clusters (k) that will be generated in the
final solution. The algorithm starts by randomly selecting
k objects from the data set to serve as the initial centers
for the clusters. The selected objects are also known as
cluster means or centroids. Next, each of the remaining
objects is assigned to it’s closest centroid, where closest is
defined using the Euclidean distance (Eq. 1) between the
object and the cluster
mean. This step is called “cluster
assignment step”. After the assignment step, the
algorithm computes the new mean value of each cluster.
The term cluster “centroid update” is used to design this
step. Now that the centers have been recalculated, every
observation is checked again to see if it might be closer to
a different cluster. All the objects are reassigned again
using the updated cluster means. The cluster assignment
and centroid update steps are iteratively repeated until
the
cluster assignments stop changing (i.e until
convergence is achieved). That is, the clusters formed in
the current iteration are the same as those obtained in the
previous iteration.
https://uc-r.github.io/kmeans_clustering 6/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide
str(k2)
## List of 9
## $ iter : int 1
## $ ifault : int 0
https://uc-r.github.io/kmeans_clustering 7/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide
k2
## Cluster means:
##
## Clustering vector:
##
##
## Available components:
##
df %>%
as_tibble() %>%
mutate(cluster = k2$cluster,
https://uc-r.github.io/kmeans_clustering 9/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide
# plots to compare
library(gridExtra)
https://uc-r.github.io/kmeans_clustering 10/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide
1. Elbow method
2. Silhouette method
3. Gap statistic
Elbow Method
Recall that, the basic idea behind cluster partitioning
methods, such as k-means clustering, is to define clusters
such that the total intra-cluster variation (known as total
within-cluster variation or total within-cluster sum of
square) is minimized:
k=1
https://uc-r.github.io/kmeans_clustering 11/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide
set.seed(123)
plot(k.values, wss_values,
set.seed(123)
https://uc-r.github.io/kmeans_clustering 12/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide
mean(ss[, 3])
plot(k.values, avg_sil_values,
https://uc-r.github.io/kmeans_clustering 13/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide
For the observed data and the the reference data, the total
intracluster variation is computed using different values
of k. The gap statistic for a given k is defined as follow:
∗
Gapn (k) = En log(Wk ) − log(Wk ) (9)
corresponding W . k
kb
kb
) − w̄)
2
set.seed(123)
https://uc-r.github.io/kmeans_clustering 15/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide
fviz_gap_stat(gap_stat)
Extracting Results
With most of these approaches suggesting 4 as the
number of optimal clusters, we can perform the final
https://uc-r.github.io/kmeans_clustering 16/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide
set.seed(123)
print(final)
## Cluster means:
##
## Clustering vector:
##
## Available components:
##
https://uc-r.github.io/kmeans_clustering 17/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide
And we can extract the clusters and add to our initial data
to do some descriptive statistics at the cluster level:
USArrests %>%
group_by(Cluster) %>%
summarise_all("mean")
## # A tibble: 4 × 5
Additional Comments
K-means clustering is a very simple and fast algorithm.
Furthermore, it can efficiently deal with very large data
sets. However, there are some weaknesses of the k-means
approach.
https://uc-r.github.io/kmeans_clustering 19/19