0% found this document useful (0 votes)

40 views

K-Means Cluster Analysis UC Business Analytics R Programming Guide

The document provides an introduction and overview of k-means clustering analysis in R. It discusses preparing data by removing missing values and standardizing variables. It also covers calculating distance measures between observations, which is important for clustering, such as Euclidean and Manhattan distances. The k-means clustering method is then explained as splitting a dataset into k groups to identify subgroups within the data. Determining the optimal number of clusters to use is also addressed.

Uploaded by

SIDDHARTHA SINHA

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views

K-Means Cluster Analysis UC Business Analytics R Programming Guide

Uploaded by

SIDDHARTHA SINHA

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

UC Business Analytics R Programming Guide

K-means Cluster Analysis

Clustering is a broad set of
techniques for finding
subgroups of observations
within a data set. When we
cluster observations, we
want observations in the
same group to be similar and observations in different
groups to be dissimilar. Because there isn’t a response
variable, this is an unsupervised method, which implies
that it seeks to find relationships between the n
observations without being trained by a response variable.
Clustering allows us to identify which observations are
alike, and potentially categorize them therein. K-means
clustering is the simplest and the most commonly used
clustering method for splitting a dataset into a set of k
groups.

tl;dr
This tutorial serves as an introduction to the k-means
clustering method.

1. Replication Requirements: What you’ll need to

reproduce the analysis in this tutorial
2. Data Preparation: Preparing our data for cluster
analysis
3. Clustering Distance Measures: Understanding how
to measure differences in observations
4. K-Means Clustering: Calculations and methods for
creating K subgroups of the data
5. Determining Optimal Clusters: Identifying the right
number of clusters to group your data

Replication Requirements
To replicate this tutorial’s analysis you will need to load
the following packages:

https://uc-r.github.io/kmeans_clustering 1/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

library(tidyverse) # data manipulation

library(cluster) # clustering algorithms

library(factoextra) # clustering algorithms & visual

Data Preparation
To perform a cluster analysis in R, generally, the data
should be prepared as follows:

1. Rows are observations (individuals) and columns are

variables
2. Any missing value in the data must be removed or
estimated.
3. The data must be standardized (i.e., scaled) to make
variables comparable. Recall that, standardization
consists of transforming the variables such that they
have mean zero and standard deviation one.1

Here, we’ll use the built-in R data set USArrests , which

contains statistics in arrests per 100,000 residents for
assault, murder, and rape in each of the 50 US states in
1973. It includes also the percent of the population living
in urban areas

df <- USArrests

To remove any missing value that might be present in the

data, type this:

df <- na.omit(df)

As we don’t want the clustering algorithm to depend to an

arbitrary variable unit, we start by scaling/standardizing
the data using the R function scale :

df <- scale(df)

head(df)

## Murder Assault UrbanPop

## Alabama 1.24256408 0.7828393 -0.5209066 -0.003
## Alaska 0.50786248 1.1068225 -1.2117642 2.4842
## Arizona 0.07163341 1.4788032 0.9989801 1.042
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184
## California 0.27826823 1.2628144 1.7589234 2.067
## Colorado 0.02571456 0.3988593 0.8608085 1.864

https://uc-r.github.io/kmeans_clustering 2/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

Clustering Distance Measures

The classification of observations into groups requires
some methods for computing the distance or the
(dis)similarity between each pair of observations. The
result of this computation is known as a dissimilarity or
distance matrix. There are many methods to calculate this
distance information; the choice of distance measures is a
critical step in clustering. It defines how the similarity of
two elements (x, y) is calculated and it will influence the
shape of the clusters.

The choice of distance measures is a critical step in

clustering. It defines how the similarity of two elements
(x, y) is calculated and it will influence the shape of the
clusters. The classical methods for distance measures are
Euclidean and Manhattan distances, which are defined as
follow:

Euclidean distance:

 n


2
deuc (x, y) = ∑ (xi − yi ) (1)
⎷
i=1

Manhattan distance:
n

dman (x, y) = ∑ |(xi − yi )| (2)

i=1

Where, x and y are two vectors of length n.

Other dissimilarity measures exist such as correlation-

based distances, which is widely used for gene expression
data analyses. Correlation-based distance is defined by
subtracting the correlation coefficient from 1. Different
types of correlation methods can be used such as:

Pearson correlation distance:

n
∑ (xi − x̄)(yi − ȳ )
i=1
dcor (x, y) = 1 − (3)
n n
2 2
√∑ (xi − x̄) ∑ (yi − ȳ )
i=1 i=1

Spearman correlation distance:

https://uc-r.github.io/kmeans_clustering 3/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

The spearman correlation method computes the

correlation between the rank of x and the rank of y
variables.
n ′ ′ ′ ′
∑ (x − x̄ )(y − ȳ )
i=1 i i
dspear (x, y) = 1 − (4)
n ′ ′ n ′ ′
2 2
√∑ (x − x̄ ) ∑ (y − ȳ )
i=1 i i=1 i

Where x ′
i
= rank(xi ) and y ′
i
= rank(yi ) .

Kendall correlation distance:

Kendall correlation method measures the correspondence

between the ranking of x and y variables. The total
number of possible pairings of x with y observations is n(n
− 1)/2, where n is the size of x and y. Begin by ordering the
pairs by the x values. If x and y are correlated, then they
would have the same relative rank orders. Now,
for each y i

, count the number of y j > yi (concordant pairs (c)) and

the number of y j < yi (discordant pairs (d)).

Kendall correlation distance is defined as follow:

nc − nd
dkend (x, y) = 1 − (5)
1
n(n − 1)
2

The choice of distance measures is very important, as it

has a strong influence on the clustering results. For most
common clustering software, the default distance measure
is the Euclidean distance. However, depending on the type
of the data and the research questions, other dissimilarity
measures might be preferred and you should be aware of
the options.

Within R it is simple to compute and visualize the

distance matrix using the functions get_dist and
fviz_dist from the factoextra R package. This starts
to illustrate which states have large dissimilarities (red)
versus those that appear to be fairly similar (teal).

get_dist : for computing a distance matrix

between the rows of a data matrix. The default
distance computed is the Euclidean; however,
get_dist also supports distanced described in
equations 2-5 above plus others.
fviz_dist : for visualizing a distance matrix

https://uc-r.github.io/kmeans_clustering 4/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

distance <- get_dist(df)

fviz_dist(distance, gradient = list(low = "#00AFBB",

K-Means Clustering
K-means clustering is the most commonly used
unsupervised machine learning algorithm for partitioning
a given data set into a set of k groups (i.e. k clusters),
where k represents the number of groups pre-specified by
the analyst. It classifies objects in multiple groups (i.e.,
clusters), such that objects within the same cluster are as
similar as possible (i.e., high intra-class similarity),
whereas objects from different clusters are as dissimilar as
possible (i.e., low inter-class similarity). In k-means
clustering, each cluster is represented by its center (i.e,
centroid) which corresponds to the mean of points
assigned to the cluster.

The Basic Idea

The basic idea behind k-means clustering consists of
defining clusters so that the total intra-cluster variation
(known as total within-cluster variation) is minimized.
There are several k-means algorithms available. The
standard algorithm is the
Hartigan-Wong algorithm
(1979), which defines the total within-cluster variation as
the sum of squared distances Euclidean distances between
items and the corresponding centroid:

2
W (Ck ) = ∑ (xi − μk ) (6)

xi ∈Ck

https://uc-r.github.io/kmeans_clustering 5/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

where:

xi is a data point belonging to the cluster C k

μk is the mean value of the points assigned to the

cluster C k

Each observation (x ) is assigned to a given cluster such

that the sum of squares (SS) distance of the observation to

their assigned cluster centers (μ ) is minimized.
k

We define the total within-cluster variation as follows:

k k

2
tot. withiness = ∑ W (Ck ) = ∑ ∑ (xi − μk ) (7)

k=1 k=1 xi ∈Ck

The total within-cluster sum of square measures the

compactness (i.e goodness) of the clustering and we want
it to be as small as possible.

K-means Algorithm
The first step when using k-means clustering is to indicate
the number of clusters (k) that will be generated in the
final solution. The algorithm starts by randomly selecting
k objects from the data set to serve as the initial centers
for the clusters. The selected objects are also known as
cluster means or centroids. Next, each of the remaining
objects is assigned to it’s closest centroid, where closest is
defined using the Euclidean distance (Eq. 1) between the
object and the cluster
mean. This step is called “cluster
assignment step”. After the assignment step, the
algorithm computes the new mean value of each cluster.
The term cluster “centroid update” is used to design this
step. Now that the centers have been recalculated, every
observation is checked again to see if it might be closer to
a different cluster. All the objects are reassigned again
using the updated cluster means. The cluster assignment
and centroid update steps are iteratively repeated until
the
cluster assignments stop changing (i.e until
convergence is achieved). That is, the clusters formed in
the current iteration are the same as those obtained in the
previous iteration.

K-means algorithm can be summarized as follows:

https://uc-r.github.io/kmeans_clustering 6/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

1. Specify the number of clusters (K) to be created (by

the analyst)
2. Select randomly k objects from the data set as the
initial cluster centers or means
3. Assigns each observation to their closest centroid,
based on the Euclidean distance between the object
and the centroid
4. For each of the k clusters update the cluster centroid
by calculating the new mean values of all the data
points in the cluster. The centroid of a Kth cluster is
a vector of length p containing the means of all
variables for the observations
in the kth cluster; p is
the number of variables.
5. Iteratively minimize the total within sum of square
(Eq. 7). That is, iterate steps 3 and 4 until the cluster
assignments stop changing or the maximum number
of iterations is reached. By default, the R software
uses 10 as the default value
for the maximum
number of iterations.

Computing k-means clustering in R

We can compute k-means in R with the kmeans function.
Here will group the data into two clusters ( centers =
2 ). The kmeans function also has an nstart option
that attempts multiple initial configurations and reports
on the best one. For example, adding nstart = 25 will
generate 25 initial configurations. This approach is often
recommended.

k2 <- kmeans(df, centers = 2, nstart = 25)

str(k2)

## List of 9

## $ cluster : Named int [1:50] 1 1 1 2 1 1 2 2

## ..- attr(*, "names")= chr [1:50] "Alabama" "Ala
## $ centers : num [1:2, 1:4] 1.005 -0.67 1.014
## ..- attr(*, "dimnames")=List of 2

## .. ..$ : chr [1:2] "1" "2"

## .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop

## $ totss : num 196

## $ withinss : num [1:2] 46.7 56.1

## $ tot.withinss: num 103

## $ betweenss : num 93.1

## $ size : int [1:2] 20 30

## $ iter : int 1

## $ ifault : int 0

## - attr(*, "class")= chr "kmeans"

https://uc-r.github.io/kmeans_clustering 7/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

The output of kmeans is a list with several bits of

information. The most important being:

cluster : A vector of integers (from 1:k) indicating

the cluster to which each point is allocated.
centers : A matrix of cluster centers.
totss : The total sum of squares.
withinss : Vector of within-cluster sum of squares,
one component per cluster.
tot.withinss : Total within-cluster sum of
squares, i.e. sum(withinss).
betweenss : The between-cluster sum of squares,
i.e. $totss-tot.withinss$.
size : The number of points in each cluster.

If we print the results we’ll see that our groupings resulted

in 2 cluster sizes of 30 and 20. We see the cluster centers
(means) for the two groups across the four variables
(Murder, Assault, UrbanPop, Rape). We also get the cluster
assignment for each observation (i.e. Alabama was
assigned to cluster 2, Arkansas was assigned to cluster 1,
etc.).

## K-means clustering with 2 clusters of sizes 20, 3

## Cluster means:

## Murder Assault UrbanPop Rape

## 1 1.004934 1.0138274 0.1975853 0.8469650

## 2 -0.669956 -0.6758849 -0.1317235 -0.5646433

## Clustering vector:

## Alabama Alaska Arizona

## 1 1 1
## Colorado Connecticut Delaware
## 1 2 2
## Hawaii Idaho Illinois
## 2 2 1
## Kansas Kentucky Louisiana
## 2 2 1
## Massachusetts Michigan Minnesota M
## 2 1 2
## Montana Nebraska Nevada New
## 2 2 1
## New Mexico New York North Carolina Nor
## 1 1 1
## Oklahoma Oregon Pennsylvania Rho
## 2 2 2
## South Dakota Tennessee Texas
## 2 1 1
## Virginia Washington West Virginia
## 2 2 2
https://uc-r.github.io/kmeans_clustering 8/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

## Within cluster sum of squares by cluster:

## [1] 46.74796 56.11445

## (between_SS / total_SS = 47.5 %)

## Available components:

## [1] "cluster" "centers" "totss"

## [5] "tot.withinss" "betweenss" "size"
## [9] "ifault"

We can also view our results by using fviz_cluster .

This provides a nice illustration of the clusters. If there
are more than two dimensions (variables) fviz_cluster
will perform principal component analysis (PCA) and plot
the data points according to the first two principal
components that explain the majority of the variance.

fviz_cluster(k2, data = df)

Alternatively, you can use standard pairwise scatter plots

to illustrate the clusters compared to the original
variables.

df %>%

as_tibble() %>%

mutate(cluster = k2$cluster,

state = row.names(USArrests)) %>%

ggplot(aes(UrbanPop, Murder, color = factor(cluster

geom_text()

https://uc-r.github.io/kmeans_clustering 9/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

Because the number of clusters (k) must be set before we

start the algorithm, it is often advantageous to use several
different values of k and examine the differences in the
results. We can execute the same process for 3, 4, and 5
clusters, and the results are shown in the figure:

k3 <- kmeans(df, centers = 3, nstart = 25)

k4 <- kmeans(df, centers = 4, nstart = 25)

k5 <- kmeans(df, centers = 5, nstart = 25)

# plots to compare

p1 <- fviz_cluster(k2, geom = "point", data = df) + g

p2 <- fviz_cluster(k3, geom = "point", data = df) +
p3 <- fviz_cluster(k4, geom = "point", data = df) +
p4 <- fviz_cluster(k5, geom = "point", data = df) +

library(gridExtra)

grid.arrange(p1, p2, p3, p4, nrow = 2)

https://uc-r.github.io/kmeans_clustering 10/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

Although this visual assessment tells us where true

dilineations occur (or do not occur such as clusters 2 & 4
in the k = 5 graph) between clusters, it does not tell us
what the optimal number of clusters is.

Determining Optimal Clusters

As you may recall the analyst specifies the number of
clusters to use; preferably the analyst would like to use
the optimal number of clusters. To aid the analyst, the
following explains the three most popular methods for
determining the optimal clusters, which includes:

1. Elbow method
2. Silhouette method
3. Gap statistic

Elbow Method
Recall that, the basic idea behind cluster partitioning
methods, such as k-means clustering, is to define clusters
such that the total intra-cluster variation (known as total
within-cluster variation or total within-cluster sum of
square) is minimized:

minimize( ∑ W (Ck )) (8)

k=1

where C is the k cluster and W (C ) is the within-

k
th
k

cluster variation. The total within-cluster sum of square

(wss) measures the compactness of the clustering and we
want it to be as small as possible. Thus, we can use the
following algorithm to define the optimal clusters:

1. Compute clustering algorithm (e.g., k-means

clustering) for different values of k. For instance, by
varying k from 1 to 10 clusters
2. For each k, calculate the total within-cluster sum of
square (wss)
3. Plot the curve of wss according to the number of
clusters k.
4. The location of a bend (knee) in the plot is generally
considered as an indicator of the appropriate
number of clusters.

https://uc-r.github.io/kmeans_clustering 11/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

We can implement this in R with the following code. The

results suggest that 4 is the optimal number of clusters as
it appears to be the bend in the knee (or elbow).

set.seed(123)

# function to compute total within-cluster sum of squ

wss <- function(k) {

kmeans(df, k, nstart = 10 )$tot.withinss

# Compute and plot wss for k = 1 to k = 15

k.values <- 1:15

# extract wss for 2-15 clusters

wss_values <- map_dbl(k.values, wss)

plot(k.values, wss_values,

type="b", pch = 19, frame = FALSE,

xlab="Number of clusters K",

ylab="Total within-clusters sum of squares")

Fortunately, this process to compute the “Elbow method”

has been wrapped up in a single function
( fviz_nbclust ):

set.seed(123)

fviz_nbclust(df, kmeans, method = "wss")

https://uc-r.github.io/kmeans_clustering 12/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

Average Silhouette Method

In short, the average silhouette approach measures the
quality of a clustering. That is, it determines how well
each object lies within its cluster. A high average
silhouette width indicates a good clustering. The average
silhouette method computes the average silhouette of
observations for different values of k. The optimal number
of clusters k is the one that maximizes the average
silhouette over a range of possible values for k.2

We can use the silhouette function in the cluster

package to compuate the average silhouette width. The
following code computes this approach for 1-15 clusters.
The results show that 2 clusters maximize the average
silhouette values with 4 clusters coming in as second
optimal number of clusters.

# function to compute average silhouette for k cluste

avg_sil <- function(k) {

km.res <- kmeans(df, centers = k, nstart = 25)

ss <- silhouette(km.res$cluster, dist(df))

mean(ss[, 3])

# Compute and plot wss for k = 2 to k = 15

k.values <- 2:15

# extract avg silhouette for 2-15 clusters

avg_sil_values <- map_dbl(k.values, avg_sil)

plot(k.values, avg_sil_values,

type = "b", pch = 19, frame = FALSE,

xlab = "Number of clusters K",

ylab = "Average Silhouettes")

https://uc-r.github.io/kmeans_clustering 13/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

Similar to the elbow method, this process to compute the

“average silhoutte method” has been wrapped up in a
single function ( fviz_nbclust ):

fviz_nbclust(df, kmeans, method = "silhouette")

Gap Statistic Method

The gap statistic has been published by R. Tibshirani, G.
Walther, and T. Hastie (Standford University, 2001). The
approach can be applied to any clustering method (i.e. K-
means clustering, hierarchical clustering). The gap
statistic compares the total intracluster variation for
different values of k with their expected values under null
reference distribution of the data (i.e. a distribution with
no obvious clustering). The reference dataset is generated
using Monte Carlo simulations of the sampling process.
https://uc-r.github.io/kmeans_clustering 14/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

That is, for each variable (x ) in the data set we compute

its range [min(x i ), max(xj )] and generate values for the

n points uniformly from the interval min to max.

For the observed data and the the reference data, the total
intracluster variation is computed using different values
of k. The gap statistic for a given k is defined as follow:
∗
Gapn (k) = En log(Wk ) − log(Wk ) (9)

Where E denotes the expectation under a sample size n

∗
n

from the reference distribution. E is defined via ∗

bootstrapping (B) by generating B copies of the reference

datasets and, by computing the average log(W ∗
k
) . The gap
statistic measures the deviation of the observed W value k

from its expected value under the null hypothesis. The

estimate of the optimal clusters (k
^
) will be the value that
maximizes Gap (k). This means that the clustering
n

structure is far away from the uniform distribution of

points.

In short, the algorithm involves the following steps:

1. Cluster the observed data, varying the number of

clusters from k = 1, … , k , and compute the
max

corresponding W . k

2. Generate B reference data sets and cluster each of

them with varying number of clusters
k = 1, … , k . Compute the estimated gap
max

statistics presented in eq. 9.

3. Let w̄ = (1/B) ∑ log(W ), compute the standard
b
∗

deviation sd(k) = √(1/b) ∑ (log(W

b
∗

kb
) − w̄)
2

and define s k = sdk × √1 + 1/B .

4. Choose the number of clusters as the smallest k such
that Gap(k) ≥ Gap(k + 1) − s . k+1

To compute the gap statistic method we can use the

clusGap function which provides the gap statistic and
standard error for an output.

# compute gap statistic

set.seed(123)

gap_stat <- clusGap(df, FUN = kmeans, nstart = 25,

K.max = 10, B = 50)

# Print the result

print(gap_stat, method = "firstmax")

https://uc-r.github.io/kmeans_clustering 15/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

## Clustering Gap statistic ["clusGap"] from call:

## clusGap(x = df, FUNcluster = kmeans, K.max = 10, B

## B=50 simulated reference sets, k = 1..10; spaceH0=
## --> Number of clusters (method 'firstmax'): 4

## logW E.logW gap SE.sim

## [1,] 3.458369 3.638250 0.1798804 0.03653200

## [2,] 3.135112 3.371452 0.2363409 0.03394132

## [3,] 2.977727 3.235385 0.2576588 0.03635372

## [4,] 2.826221 3.120441 0.2942199 0.03615597

## [5,] 2.738868 3.020288 0.2814197 0.03950085

## [6,] 2.669860 2.933533 0.2636730 0.03957994

## [7,] 2.598748 2.855759 0.2570109 0.03809451

## [8,] 2.531626 2.784000 0.2523744 0.03869283

## [9,] 2.468162 2.716498 0.2483355 0.03971815

## [10,] 2.394884 2.652241 0.2573567 0.04104674

We can visualize the results with fviz_gap_stat which

suggests four clusters as the optimal number of clusters.

fviz_gap_stat(gap_stat)

In addition to these commonly used approaches, the

NbClust package, published by Charrad et al., 2014,
provides 30 indices for determining the relevant number
of clusters and proposes to users the best clustering
scheme from the different results obtained by varying all
combinations of number of clusters, distance measures,
and clustering methods.

Extracting Results
With most of these approaches suggesting 4 as the
number of optimal clusters, we can perform the final

https://uc-r.github.io/kmeans_clustering 16/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

analysis and extract the results using 4 clusters.

# Compute k-means clustering with k = 4

set.seed(123)

final <- kmeans(df, 4, nstart = 25)

print(final)

## K-means clustering with 4 clusters of sizes 13, 1

## Cluster means:

## Murder Assault UrbanPop Rape

## 1 -0.9615407 -1.1066010 -0.9301069 -0.96676331

## 2 -0.4894375 -0.3826001 0.5758298 -0.26165379

## 3 0.6950701 1.0394414 0.7226370 1.27693964

## 4 1.4118898 0.8743346 -0.8145211 0.01927104

## Clustering vector:

## Alabama Alaska Arizona

## 4 3 3
## Colorado Connecticut Delaware
## 3 2 2
## Hawaii Idaho Illinois
## 2 1 3
## Kansas Kentucky Louisiana
## 2 1 4
## Massachusetts Michigan Minnesota M
## 2 3 1
## Montana Nebraska Nevada New
## 1 1 3
## New Mexico New York North Carolina Nor
## 3 3 4
## Oklahoma Oregon Pennsylvania Rho
## 2 2 2
## South Dakota Tennessee Texas
## 1 4 3
## Virginia Washington West Virginia
## 2 2 1
##

## Within cluster sum of squares by cluster:

## [1] 11.952463 16.212213 19.922437 8.316061

## (between_SS / total_SS = 71.2 %)

## Available components:

## [1] "cluster" "centers" "totss"

## [5] "tot.withinss" "betweenss" "size"
## [9] "ifault"

We can visualize the results using fviz_cluster :

fviz_cluster(final, data = df)

https://uc-r.github.io/kmeans_clustering 17/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

And we can extract the clusters and add to our initial data
to do some descriptive statistics at the cluster level:

USArrests %>%

mutate(Cluster = final$cluster) %>%

group_by(Cluster) %>%

summarise_all("mean")

## # A tibble: 4 × 5

## Cluster Murder Assault UrbanPop Rape

## <int> <dbl> <dbl> <dbl> <dbl>

## 1 1 3.60000 78.53846 52.07692 12.17692

## 2 2 5.65625 138.87500 73.87500 18.78125

## 3 3 10.81538 257.38462 76.00000 33.19231

## 4 4 13.93750 243.62500 53.75000 21.41250

Additional Comments
K-means clustering is a very simple and fast algorithm.
Furthermore, it can efficiently deal with very large data
sets. However, there are some weaknesses of the k-means
approach.

One potential disadvantage of K-means clustering is that

it requires us to pre-specify the number of clusters.
Hierarchical clustering is an alternative approach which
does not require that we commit to a particular choice of
clusters. Hierarchical clustering has an added advantage
over K-means clustering in that it results in an attractive
tree-based representation of the observations, called a
dendrogram. A future tutorial will illustrate the
hierarchical clustering approach.

An additional disadvantage of K-means is that it’s

sensitive to outliers and different results can occur if you
https://uc-r.github.io/kmeans_clustering 18/19
8/6/2021 K-means Cluster Analysis · UC Business Analytics R Programming Guide

change the ordering of your data. The Partitioning Around

Medoids (PAM) clustering approach is less sensititive to
outliers and provides a robust alternative to k-means to
deal with these situations. A future tutorial will illustrate
the PAM clustering approach.

For now, you can learn more about clustering methods

with:

An Introduction to Statistical Learning

Applied Predictive Modeling
Elements of Statistical Learning
A Practical Guide to Cluster Analysis in R

1. Standardization makes the four distance measure methods -

Euclidean, Manhattan, Correlation and Eisen - more similar than they
would be with non-transformed data. ↩

2. Kaufman and Rousseeuw, 1990 ↩

https://uc-r.github.io/kmeans_clustering 19/19

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
Knee Ability Zero Now Complete As A Picture Book 4 PDF Free
94% (68)
Knee Ability Zero Now Complete As A Picture Book 4 PDF Free
49 pages
Read People Like A Book by Patrick King-Edited
61% (72)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (77)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (541)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (29)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
78% (27)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (55)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (7)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
ALCHEMIST
64% (14)
ALCHEMIST
4 pages
1001 Songs
70% (71)
1001 Songs
1,798 pages
Cessna 421C AFM PDF
100% (2)
Cessna 421C AFM PDF
25 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
3.1.4 - EM CK 01 D - Fire in Engineroom
No ratings yet
3.1.4 - EM CK 01 D - Fire in Engineroom
1 page
The Nightingale and The Rose Notes
100% (1)
The Nightingale and The Rose Notes
9 pages
L/L Research: The Law of One, Book III, Session 65
No ratings yet
L/L Research: The Law of One, Book III, Session 65
5 pages
R Code For Discriminant and Cluster Analysis
No ratings yet
R Code For Discriminant and Cluster Analysis
23 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Overview of Clustering K Means
No ratings yet
Overview of Clustering K Means
8 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Data Warehousing and Data Mining Iv-Cse A: Prepared by
No ratings yet
Data Warehousing and Data Mining Iv-Cse A: Prepared by
5 pages
Unit4 Datascience
No ratings yet
Unit4 Datascience
43 pages
dataminingshort Question part2
No ratings yet
dataminingshort Question part2
17 pages
Recursive Hierarchical Clustering Algorithm
No ratings yet
Recursive Hierarchical Clustering Algorithm
7 pages
Unit 4
No ratings yet
Unit 4
74 pages
ML - Unit 5
No ratings yet
ML - Unit 5
22 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
UCS551 Chapter 7 - Clustering
No ratings yet
UCS551 Chapter 7 - Clustering
9 pages
Unit 4
No ratings yet
Unit 4
40 pages
DSV_Unit 3_Data Analysis in Depth
No ratings yet
DSV_Unit 3_Data Analysis in Depth
53 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Data Mining and Visualization Question Bank
100% (1)
Data Mining and Visualization Question Bank
11 pages
Comparative Study of K-Means and Hierarchical Clustering Techniques
No ratings yet
Comparative Study of K-Means and Hierarchical Clustering Techniques
7 pages
Data Warehousing and Mining: Ii Unit: Data Preprocessing, Language Architecture Concept Description
No ratings yet
Data Warehousing and Mining: Ii Unit: Data Preprocessing, Language Architecture Concept Description
7 pages
Summary - MachineLearning (Part 2)
No ratings yet
Summary - MachineLearning (Part 2)
19 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
JNTUK R20 B.Tech CSE 3-2 Machine Learning Unit 4 Notes
No ratings yet
JNTUK R20 B.Tech CSE 3-2 Machine Learning Unit 4 Notes
23 pages
Clustering For Big Data Analytics
No ratings yet
Clustering For Big Data Analytics
28 pages
Clustering in R
No ratings yet
Clustering in R
12 pages
Ambo University Inistitute of Technology Department of Computer Science
No ratings yet
Ambo University Inistitute of Technology Department of Computer Science
13 pages
DWDM Unit 5
No ratings yet
DWDM Unit 5
43 pages
Task 22
No ratings yet
Task 22
5 pages
Data Mining
No ratings yet
Data Mining
10 pages
ML Unit-4-1
No ratings yet
ML Unit-4-1
39 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
Week 5 Discussion 2 Algorithms of Cluster Analysis. 1) What Is K-Means From A Basic Standpoint?
No ratings yet
Week 5 Discussion 2 Algorithms of Cluster Analysis. 1) What Is K-Means From A Basic Standpoint?
4 pages
Unit-Iv Material
No ratings yet
Unit-Iv Material
24 pages
5 CS 03 Ijsrcse
No ratings yet
5 CS 03 Ijsrcse
4 pages
Data Clustering in K-Means Hierarchical Clustering DBSCAN Clustering
No ratings yet
Data Clustering in K-Means Hierarchical Clustering DBSCAN Clustering
14 pages
Unit 4
No ratings yet
Unit 4
20 pages
Unit 4 Introduction to Algorithm
No ratings yet
Unit 4 Introduction to Algorithm
10 pages
Unit-4 - Data Ware
No ratings yet
Unit-4 - Data Ware
59 pages
Back Propagated K-Mean Clustering For Prediction of Slow Learners
No ratings yet
Back Propagated K-Mean Clustering For Prediction of Slow Learners
5 pages
UNIT 4 ML Notes
No ratings yet
UNIT 4 ML Notes
22 pages
K Mean
No ratings yet
K Mean
7 pages
6 Clustering
No ratings yet
6 Clustering
15 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
47 pages
PRJ C MR 18
No ratings yet
PRJ C MR 18
4 pages
Solutions To DM I MID (A)
100% (1)
Solutions To DM I MID (A)
19 pages
CV UNIT 4
No ratings yet
CV UNIT 4
60 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
An Enhanced Clustering Algorithm To Analyze Spatial Data: Dr. Mahesh Kumar, Mr. Sachin Yadav
No ratings yet
An Enhanced Clustering Algorithm To Analyze Spatial Data: Dr. Mahesh Kumar, Mr. Sachin Yadav
3 pages
Unit 5 - Cluster Analysis
No ratings yet
Unit 5 - Cluster Analysis
14 pages
Unit 7 Clustering (P) (1) (1)
No ratings yet
Unit 7 Clustering (P) (1) (1)
22 pages
Attack Detection by Clustering and Classification Approach: Ms. Priyanka J. Pathak, Asst. Prof. Snehlata S. Dongre
No ratings yet
Attack Detection by Clustering and Classification Approach: Ms. Priyanka J. Pathak, Asst. Prof. Snehlata S. Dongre
4 pages
Unit 5 - Cluster Analysis
No ratings yet
Unit 5 - Cluster Analysis
14 pages
ML UNIT-III
No ratings yet
ML UNIT-III
18 pages
A Parallel Study On Clustering Algorithms in Data Mining
No ratings yet
A Parallel Study On Clustering Algorithms in Data Mining
7 pages
Comparative Analysis of Various Decision
No ratings yet
Comparative Analysis of Various Decision
7 pages
Hierarchical Clustering: Required Data
No ratings yet
Hierarchical Clustering: Required Data
6 pages
1.0 Modeling: 1.1 Classification
No ratings yet
1.0 Modeling: 1.1 Classification
5 pages
new90机器学习刘扬
No ratings yet
new90机器学习刘扬
12 pages
Ijettcs 2014 04 25 123
No ratings yet
Ijettcs 2014 04 25 123
5 pages
DWDM 4
No ratings yet
DWDM 4
58 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Emily Dickinson
No ratings yet
Emily Dickinson
2 pages
SV2021112102
No ratings yet
SV2021112102
12 pages
Junior Engineer MP Poorva
No ratings yet
Junior Engineer MP Poorva
25 pages
Minutiae Based FP Matching
No ratings yet
Minutiae Based FP Matching
61 pages
Teaching Strategies Gold Assessment Touring Guide
No ratings yet
Teaching Strategies Gold Assessment Touring Guide
36 pages
En Slimstock Retail
No ratings yet
En Slimstock Retail
2 pages
13943869250
No ratings yet
13943869250
2 pages
Sample Paper Reliance Foundation Ug Scholarship Final
No ratings yet
Sample Paper Reliance Foundation Ug Scholarship Final
4 pages
Vsphere Esxi Vcenter Server 601 Monitoring Performance Guide
No ratings yet
Vsphere Esxi Vcenter Server 601 Monitoring Performance Guide
208 pages
Sports Nutrition
No ratings yet
Sports Nutrition
20 pages
Cement and Concrete Research
No ratings yet
Cement and Concrete Research
12 pages
SugaNate 160
No ratings yet
SugaNate 160
5 pages
Dictionary of Economics
100% (2)
Dictionary of Economics
225 pages
Activity Compendium SLM 1
No ratings yet
Activity Compendium SLM 1
7 pages
Jodhpur Escorts Service Call Ravina 9509259176 Sexy Call Girls in Jodhpur
No ratings yet
Jodhpur Escorts Service Call Ravina 9509259176 Sexy Call Girls in Jodhpur
27 pages
4 Differential Amplifier With Output
No ratings yet
4 Differential Amplifier With Output
5 pages
COP 1838ME-AW 8311 1247 43
No ratings yet
COP 1838ME-AW 8311 1247 43
58 pages
Beethoven's Vision of Joy in The Finale of The Ninth Symphony Robert Pascall
100% (1)
Beethoven's Vision of Joy in The Finale of The Ninth Symphony Robert Pascall
82 pages
Daffodils
No ratings yet
Daffodils
3 pages
Šri Caturmasya Sankalpam
No ratings yet
Šri Caturmasya Sankalpam
9 pages
Bear HMJ A35m1
No ratings yet
Bear HMJ A35m1
6 pages
Log
No ratings yet
Log
5 pages
Download ebooks file DATABASE SYSTEMS AN APPLICATION ORIENTED APPROACH SECOND EDITION SOLUTION MANUAL Michael Kifer all chapters
100% (7)
Download ebooks file DATABASE SYSTEMS AN APPLICATION ORIENTED APPROACH SECOND EDITION SOLUTION MANUAL Michael Kifer all chapters
82 pages
BSEE's Proposed Production Systems Rule
No ratings yet
BSEE's Proposed Production Systems Rule
149 pages
Lesson 2: Methods of Philosophizing: Prepared By: Marilyn R. Calderon
100% (1)
Lesson 2: Methods of Philosophizing: Prepared By: Marilyn R. Calderon
56 pages
American Literature
No ratings yet
American Literature
3 pages