50% found this document useful (2 votes)
1K views

Mini Project Clustering

This document describes a mini-project to perform customer segmentation using k-means clustering. The data contains customer transaction data and marketing offer data. The data is preprocessed by merging the datasets and pivoting the data to create a matrix with customers as rows and offers as columns containing 1s and 0s to indicate responses. K-means clustering is then applied to cluster customers based on their response patterns, with the elbow method used to select the number of clusters K by evaluating the sum of squared errors for different values of K.

Uploaded by

api-362845526
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
50% found this document useful (2 votes)
1K views

Mini Project Clustering

This document describes a mini-project to perform customer segmentation using k-means clustering. The data contains customer transaction data and marketing offer data. The data is preprocessed by merging the datasets and pivoting the data to create a matrix with customers as rows and offers as columns containing 1s and 0s to indicate responses. K-means clustering is then applied to cluster customers based on their response patterns, with the elbow method used to select the number of clusters K by evaluating the sum of squared errors for different values of K.

Uploaded by

api-362845526
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Mini_Project_Clustering

July 18, 2017

1 Customer Segmentation using Clustering

This mini-project is based on this blog post by yhat. Please feel free to refer to the post for
additional information, and solutions.

In [2]: %matplotlib inline


import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

# Setup Seaborn
sns.set_style("whitegrid")
sns.set_context("poster")

1.1 Data
The dataset contains information on marketing newsletters/e-mail campaigns (e-mail offers sent
to customers) and transaction level data from customers. The transactional data shows which
offer customers responded to, and what the customer ended up buying. The data is presented as
an Excel workbook containing two worksheets. Each worksheet contains a different dataset.

In [3]: df_offers = pd.read_excel("./WineKMC.xlsx", sheetname=0) # specify which sh


df_offers.columns = ["offer_id", "campaign", "varietal", "min_qty", "discou
df_offers.head()

Out[3]: offer_id campaign varietal min_qty discount origin


0 1 January Malbec 72 56 France
1 2 January Pinot Noir 72 17 France
2 3 February Espumante 144 32 Oregon
3 4 February Champagne 72 48 France
4 5 February Cabernet Sauvignon 144 44 New Zealand

past_peak

1
0 False
1 False
2 True
3 True
4 True

We see that the first dataset contains information about each offer such as the month it is in
effect and several attributes about the wine that the offer refers to: the variety, minimum quantity,
discount, country of origin and whether or not it is past peak. The second dataset in the second
worksheet contains transactional data which offer each customer responded to.

In [4]: df_transactions = pd.read_excel("./WineKMC.xlsx", sheetname=1)


df_transactions.columns = ["customer_name", "offer_id"]
df_transactions['n'] = 1
df_transactions.head()

Out[4]: customer_name offer_id n


0 Smith 2 1
1 Smith 24 1
2 Johnson 17 1
3 Johnson 24 1
4 Johnson 26 1

1.2 Data wrangling


Were trying to learn more about how our customers behave, so we can use their behavior
(whether or not they purchased something based on an offer) as a way to group similar minded
customers together. We can then study those groups to look for patterns and trends which can
help us formulate future offers.
The first thing we need is a way to compare customers. To do this, were going to create a
matrix that contains each customer and a 0/1 indicator for whether or not they responded to a
given offer.
Checkup Exercise Set I
Exercise: Create a data frame where each row has the following columns (Use the pandas
merge and pivot_table functions for this purpose):
customer_name
One column for each offer, with a 1 if the customer responded to the offer
Make sure you also deal with any weird values such as NaN. Read the documentation to de-
velop your solution.

In [20]: # merge
df_merge = df_transactions.merge(df_offers, how='left', on='offer_id') # s
df_merge.head()

Out[20]: customer_name offer_id n campaign varietal min_qty discount \


0 Smith 2 1 January Pinot Noir 72 17
1 Smith 24 1 September Pinot Noir 6 34
2 Johnson 17 1 July Pinot Noir 12 47
3 Johnson 24 1 September Pinot Noir 6 34

2
4 Johnson 26 1 October Pinot Noir 144 83

origin past_peak
0 France False
1 Italy False
2 Germany False
3 Italy False
4 Australia False

In [32]: df_merge[df_merge.customer_name=='Allen']

Out[32]: customer_name offer_id n campaign varietal min_qty discount \


102 Allen 9 1 April Chardonnay 144 57
103 Allen 27 1 October Champagne 72 88

origin past_peak
102 Chile False
103 New Zealand False

In [34]: # unfold
df_pivot = pd.pivot_table(df_merge, index='customer_name', columns='offer_
df_pivot.head()

Out[34]: offer_id 1 2 3 4 5 6 7 8 9 10 ... 23 24 25 26


customer_name ...
Adams 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
Allen 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0
Anderson 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 1
Bailey 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0
Baker 0 0 0 0 0 0 1 0 0 1 ... 0 0 0 0

offer_id 28 29 30 31 32
customer_name
Adams 0 1 1 0 0
Allen 0 0 0 0 0
Anderson 0 0 0 0 0
Bailey 0 0 1 0 0
Baker 0 0 0 1 0

[5 rows x 32 columns]

1.3 K-Means Clustering


Recall that in K-Means Clustering we want to maximize the distance between centroids and mini-
mize the distance between data points and the respective centroid for the cluster they are in. True
evaluation for unsupervised learning would require labeled data; however, we can use a variety
of intuitive metrics to try to pick the number of clusters K. We will introduce two methods: the
Elbow method, the Silhouette method and the gap statistic.

3
1.3.1 Choosing K: The Elbow Sum-of-Squares Method
The first method looks at the sum-of-squares error in each cluster against K. We compute the
distance from each data point to the center of the cluster (centroid) to which the data point was
assigned.
X X X X X
SS = (xi xj )2 = (xi k )2
k xi Ck xj Ck k xi Ck

where xi is a point, Ck represents cluster k and k is the centroid for cluster k. We can plot SS
vs. K and choose the elbow point in the plot as the best value for K. The elbow point is the point
at which the plot starts descending much more slowly.
Checkup Exercise Set II
Exercise:
What values of SS do you believe represent better clusterings? Why?
Create a numpy matrix x_cols with only the columns representing the offers (i.e. the 0/1
colums)
Write code that applies the KMeans clustering method from scikit-learn to this matrix.
Construct a plot showing SS for each K and pick K using this plot. For simplicity, test 2
K 10.
Make a bar chart showing the number of points in each cluster for k-means under the best K.
What challenges did you experience using the Elbow method to pick K?
Smaller values of SS represents better clusterings since smaller SS means each points is closer
to its assigned center. However, this explanation only assumes the number of centers given is
reasonable because in cases like large k or even k = n, the SS is very small but it does not make
any sense to cluster using such a big number of centers.

In [64]: # apply KMeans


x_col=df_pivot
cluster = KMeans(n_clusters=5)
label = cluster.fit_predict(x_col) # no need to train
df_pivot['label'] = label
print('SS is %.3f.' % round(cluster.inertia_,4))
df_pivot.head()

SS is 203.387.

Out[64]: offer_id 1 2 3 4 5 6 7 8 9 10 ... 24 25 26 27 28 2


customer_name ...
Adams 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0
Allen 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 1 0
Anderson 0 0 0 0 0 0 0 0 0 0 ... 1 0 1 0 0
Bailey 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0
Baker 0 0 0 0 0 0 1 0 0 1 ... 0 0 0 0 0

offer_id 30 31 32 label
customer_name
Adams 1 0 0 0

4
Allen 0 0 0 2
Anderson 0 0 0 3
Bailey 1 0 0 0
Baker 0 1 0 4

[5 rows x 33 columns]

In [96]: # find out the optimal k


SS = []
opt_k = -1
for k in range(2, 11):
temp_model = KMeans(n_clusters=k, random_state=10) # use random_state
temp_ss = temp_model.fit(x_col).inertia_
if SS != [] and temp_ss < min(SS):
opt_k = k
SS.append(temp_ss)

print("We obtain the lowest SS %.3f" % min(SS), "at k = %.f" % opt_k)

We obtain the lowest SS 170.226 at k = 10

In [95]: plt.plot(list(range(2,11)), SS)


plt.title('SS of Kmeans Cluster')
plt.xlabel('Number of groups')
plt.ylabel('SS')
plt.show()

5
Since we dont want too many clusters since otherwise it would not make sense to cluster. We
observe that there is a significant drop in SS between 2 and 3 groups. We therefore choose the
optimal k as 3.

In [111]: cluster_3 = KMeans(n_clusters=3)


label_3 = cluster_3.fit_predict(x_col)
plt.bar(pd.Series(label_3).unique()-0.4, pd.Series(label_3).value_counts(
plt.xticks([0, 1, 2])
plt.xlabel('Groups')
plt.ylabel('Counts')
plt.show()

6
1.3.2 Choosing K: The Silhouette Method
There exists another method that measures how well each datapoint xi fits its assigned cluster
and also how poorly it fits into other clusters. This is a different way of looking at the same
objective. Denote axi as the average distance from xi to all other points within its own cluster k.
The lower the value, the better. On the other hand bxi is the minimum average distance from xi to
points in a different cluster, minimized over clusters. That is, compute separately for each cluster
the average distance from xi to the points within that cluster, and then take the minimum. The
silhouette s(xi ) is defined as

bxi axi
s(xi ) =
max (axi , bxi )
The silhouette score is computed on every datapoint in every cluster. The silhouette score ranges
from -1 (a poor clustering) to +1 (a very dense clustering) with 0 denoting the situation where
clusters overlap. Some criteria for the silhouette coefficient is provided in the table below.
Source: http://www.stat.berkeley.edu/~spector/s133/Clus.html
Fortunately, scikit-learn provides a function to compute this for us (phew!) called
sklearn.metrics.silhouette_score. Take a look at this article on picking K in scikit-learn,
as it will help you in the next exercise set.
Checkup Exercise Set III
Exercise: Using the documentation for the silhouette_score function above, construct a
series of silhouette plots like the ones in the article linked above.

7
Exercise: Compute the average silhouette score for each K and plot it. What K does the plot
suggest we should choose? Does it differ from what we found using the Elbow method?

In [137]: from sklearn.datasets import make_blobs


from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

X = x_col
silhouette_dict = {}

range_n_clusters = range(2, 11)

for n_clusters in range_n_clusters:


# Create a subplot with 1 row and 2 columns
fig, ax1 = plt.subplots(1)
fig.set_size_inches(5, 5)

# This subplot is the silhouette plot


# The silhouette coefficient can range from -1, 1 but in this example
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouet
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

# Initialize the clusterer with n_clusters value and a random generat


# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)

# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the for
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
silhouette_dict[n_clusters] = silhouette_avg

# Compute the silhouette scores for each sample


sample_silhouette_values = silhouette_samples(X, cluster_labels)

y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them

8
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]

ith_cluster_silhouette_values.sort()

size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i

color = cm.spectral(float(i) / n_clusters)


ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)

# Label the silhouette plots with their cluster numbers at the mi


ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

# Compute the new y_lower for next plot


y_lower = y_upper + 10 # 10 for the 0 samples

ax1.set_title("The silhouette plot for the various clusters.")


ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")

# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

ax1.set_yticks([]) # Clear the yaxis labels / ticks


ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

plt.show()

For n_clusters = 2 The average silhouette_score is : 0.280178305014

9
For n_clusters = 3 The average silhouette_score is : 0.250659377196

10
For n_clusters = 4 The average silhouette_score is : 0.236772886303

11
For n_clusters = 5 The average silhouette_score is : 0.209951014144

12
For n_clusters = 6 The average silhouette_score is : 0.205119021976

13
For n_clusters = 7 The average silhouette_score is : 0.192867897806

14
For n_clusters = 8 The average silhouette_score is : 0.193458942962

15
For n_clusters = 9 The average silhouette_score is : 0.170224641456

16
For n_clusters = 10 The average silhouette_score is : 0.16821550099

17
In [139]: # plot silhouette score
plt.plot(list(range(2,11)), list(silhouette_dict.values()))
plt.xlabel('k')
plt.ylabel('Silhouette Score')
plt.show()

18
According to the plot of Silhouette score, since we obtained the highest Silhouette score at k
= 2, we therefore conclude that there the optimal k is 2, which is different from the optimal k
concluded from Elbow method.

1.3.3 Choosing K: The Gap Statistic


There is one last method worth covering for picking K, the so-called Gap statistic. The computa-
tion for the gap statistic builds on the sum-of-squares established in the Elbow method discussion,
and compares it to the sum-of-squares of a null distribution, that is, a random set of points with
no clustering. The estimate for the optimal number of clusters K is the value for which log SS falls
the farthest below that of the reference distribution:

Gk = En {log SSk } log SSk


In other words a good clustering yields a much larger difference between the reference distri-
bution and the clustered data. The reference distribution is a Monte Carlo (randomization) pro-
cedure that constructs B random distributions of points within the bounding box (limits) of the
original data and then applies K-means to this synthetic distribution of data points.. En {log SSk }
is just the average SSk over all B replicates. We then compute the standard deviation SS of the
values of SSk computed from the B replicates of the reference distribution and compute
p
sk = 1 + 1/BSS
Finally, we choose K = k such that Gk Gk+1 sk+1 .

19
1.3.4 Aside: Choosing K when we Have Labels
Unsupervised learning expects that we do not have the labels. In some situations, we may wish to
cluster data that is labeled. Computing the optimal number of clusters is much easier if we have
access to labels. There are several methods available. We will not go into the math or details since
it is rare to have access to the labels, but we provide the names and references of these measures.

Adjusted Rand Index


Mutual Information
V-Measure
FowlkesMallows index

See this article for more information about these metrics.

1.4 Visualizing Clusters using PCA


How do we visualize clusters? If we only had two features, we could likely plot the data as
is. But we have 100 data points each containing 32 features (dimensions). Principal Component
Analysis (PCA) will help us reduce the dimensionality of our data from 32 to something lower.
For a visualization on the coordinate plane, we will use 2 dimensions. In this exercise, were going
to use it to transform our multi-dimensional dataset into a 2 dimensional dataset.
This is only one use of PCA for dimension reduction. We can also use PCA when we want to
perform regression but we have a set of highly correlated variables. PCA untangles these corre-
lations into a smaller number of features/predictors all of which are orthogonal (not correlated).
PCA is also used to reduce a large set of variables into a much smaller one.
Checkup Exercise Set IV
Exercise: Use PCA to plot your clusters:
Use scikit-learns PCA function to reduce the dimensionality of your clustering data to 2 com-
ponents
Create a data frame with the following fields:
customer name
cluster id the customer belongs to
the two PCA components (label them x and y)
Plot a scatterplot of the x vs y columns
Color-code points differently based on cluster ID
How do the clusters look?
Based on what you see, what seems to be the best value for K? Moreover, which method of
choosing K seems to have produced the optimal result visually?
Exercise: Now look at both the original raw data about the offers and transactions and look at
the fitted clusters. Tell a story about the clusters in context of the original data. For example, do
the clusters correspond to wine variants or something else interesting?

In [151]: # Initialize and fit a PCA with 2 components


pca = sklearn.decomposition.PCA(n_components=2)
# Transform the values matrix
X = pca.fit_transform(x_col)

# For k=5
cluster = KMeans(n_clusters=5, random_state=10)

20
df_pivot['cluster'] = cluster.fit_predict(x_col)

# Take customer names and clusters from the pivot dataframe


df_pivot_short = pd.DataFrame(df_pivot['cluster']).reset_index()

# Concatenate the data frames


df_pivot_short_2 = pd.DataFrame(X, columns=['x', 'y'])
df_pca = pd.concat([df_pivot_short, df_pivot_short_2], axis=1)

# Create a scatterplot of the reduced data when k=5 as shown through the
plt.rcParams["figure.figsize"] = (5,5)
colors = {0:'blue', 1:'red', 2:'green', 3:'purple', 4:'yellow'}
plt.scatter(x=df_pca['x'], y=df_pca['y'], c=df_pivot['cluster'].apply(lam
plt.xticks(size=10)
plt.xlabel('1st principal component', size=12)
plt.yticks(size=10)
plt.ylabel('2nd principal component', size=12)
plt.title('Cluster Representation by\n 2 Principal Components (k=5)', siz

21
What weve done is weve taken those columns of 0/1 indicator variables, and weve trans-
formed them into a 2-D dataset. We took one column and arbitrarily called it x and then called
the other y. Now we can throw each point into a scatterplot. We color coded each point based on
its cluster so its easier to see them.
Exercise Set V
As we saw earlier, PCA has a lot of other uses. Since we wanted to visualize our data in 2
dimensions, restricted the number of dimensions to 2 in PCA. But what is the true optimal number
of dimensions?
Exercise: Using a new PCA object shown in the next cell, plot the explained_variance_
field and look for the elbow point, the point where the curves rate of descent seems to slow
sharply. This value is one possible value for the optimal number of dimensions. What is it?

In [153]: #your turn


# Initialize a new PCA model with a default number of components.
pca = sklearn.decomposition.PCA()
pca.fit(X)

# Do the rest on your own :)


cluster = KMeans(n_clusters=3, random_state=10)
df_pivot['cluster'] = cluster.fit_predict(x_col)

# Take customer names and clusters from the pivot dataframe


df_pivot_short = pd.DataFrame(df_pivot['cluster']).reset_index()

# Initialize and fit a PCA with 2 components


pca = sklearn.decomposition.PCA(n_components=2)
# Transform the values matrix
X = pca.fit_transform(x_col)

# Concatenate the data frames


df_pivot_short_2 = pd.DataFrame(X, columns=['x', 'y'])
df_pca = pd.concat([df_pivot_short, df_pivot_short_2], axis=1)

# Create a scatterplot of the reduced data when k=5 as shown through the
plt.rcParams["figure.figsize"] = (5,5)
colors = {0:'blue', 1:'red', 2:'green', 3:'purple', 4:'yellow'}
plt.scatter(x=df_pca['x'], y=df_pca['y'], c=df_pivot['cluster'].apply(lam
plt.xticks(size=10)
plt.xlabel('1st principal component', size=12)
plt.yticks(size=10)
plt.ylabel('2nd principal component', size=12)
plt.title('Cluster Representation by\n 2 Principal Components (k=3)', siz

22
In [156]: # Initialize a PCA where components = number of features
# Extract the explained variance in a dataframe
pca = sklearn.decomposition.PCA(n_components=32)
pca.fit(x_col)
variance_pca_df = pd.DataFrame(data={'components':range(1,33),
'explained_variance':pca.explained_v

# Plot the explained variance by the number of components


plt.rcParams["figure.figsize"] = (5,5)
variance_pca_df.plot(x='components', y='explained_variance')
plt.xlabel('Number of components', size=12)
plt.ylabel('Explained Variance', size=12)
plt.xticks(size=10)
plt.yticks(size=10)
plt.title('PCA Elbow Plot');

23
In [160]: variance_pca_df = pd.DataFrame()
for dimension in range(1,33):
pca = sklearn.decomposition.PCA(n_components=dimension)
pca.fit(x_col)
temp_df = pd.DataFrame(data={'components':dimension,
'explained_variance':pca.explained_varia
index=[0])
variance_pca_df = variance_pca_df.append(temp_df)

# Plot the cumulative explained variance by the number of components


plt.rcParams["figure.figsize"] = (5,5)
variance_pca_df.plot(x='components', y='explained_variance')
plt.xlabel('Number of components', size=12)
plt.ylabel('Explained Variance', size=12)
plt.xticks(size=10)
plt.yticks(size=10)
plt.title('Cumulative variance ratio \nby number of components')
plt.legend(loc='lower right');

24
1.5 Other Clustering Algorithms
k-means is only one of a ton of clustering algorithms. Below is a brief description of several
clustering algorithms, and the table provides references to the other clustering algorithms in scikit-
learn.

Affinity Propagation does not require the number of clusters K to be known in advance!
AP uses a message passing paradigm to cluster points based on their similarity.

Spectral Clustering uses the eigenvalues of a similarity matrix to reduce the dimensional-
ity of the data before clustering in a lower dimensional space. This is tangentially similar
to what we did to visualize k-means clusters using PCA. The number of clusters must be
known a priori.

Wards Method applies to hierarchical clustering. Hierarchical clustering algorithms take


a set of data and successively divide the observations into more and more clusters at each
layer of the hierarchy. Wards method is used to determine when two clusters in the hier-
archy should be combined into one. It is basically an extension of hierarchical clustering.

25
Hierarchical clustering is divisive, that is, all observations are part of the same cluster at first,
and at each successive iteration, the clusters are made smaller and smaller. With hierarchi-
cal clustering, a hierarchy is constructed, and there is not really the concept of number of
clusters. The number of clusters simply determines how low or how high in the hierarchy
we reference and can be determined empirically or by looking at the dendogram.

Agglomerative Clustering is similar to hierarchical clustering but but is not divisive, it is


agglomerative. That is, every observation is placed into its own cluster and at each iteration
or level or the hierarchy, observations are merged into fewer and fewer clusters until con-
vergence. Similar to hierarchical clustering, the constructed hierarchy contains all possible
numbers of clusters and it is up to the analyst to pick the number by reviewing statistics or
the dendogram.

DBSCAN is based on point density rather than distance. It groups together points with
many nearby neighbors. DBSCAN is one of the most cited algorithms in the literature. It
does not require knowing the number of clusters a priori, but does require specifying the
neighborhood size.

1.5.1 Clustering Algorithms in Scikit-learn


Method name
Parameters
Scalability
Use Case
Geometry (metric used)
K-Means
number of clusters
Very largen_samples, medium n_clusters with MiniBatch code
General-purpose, even cluster size, flat geometry, not too many clusters
Distances between points
Affinity propagation
damping, sample preference
Not scalable with n_samples
Many clusters, uneven cluster size, non-flat geometry
Graph distance (e.g. nearest-neighbor graph)
Mean-shift
bandwidth
Not scalable with n_samples
Many clusters, uneven cluster size, non-flat geometry
Distances between points
Spectral clustering
number of clusters
Medium n_samples, small n_clusters
Few clusters, even cluster size, non-flat geometry
Graph distance (e.g. nearest-neighbor graph)
Ward hierarchical clustering
number of clusters
Large n_samples and n_clusters
Many clusters, possibly connectivity constraints

26
Distances between points
Agglomerative clustering
number of clusters, linkage type, distance
Large n_samples and n_clusters
Many clusters, possibly connectivity constraints, non Euclidean distances
Any pairwise distance
DBSCAN
neighborhood size
Very large n_samples, medium n_clusters
Non-flat geometry, uneven cluster sizes
Distances between nearest points
Gaussian mixtures
many
Not scalable
Flat geometry, good for density estimation
Mahalanobis distances to centers
Birch
branching factor, threshold, optional global clusterer.
Large n_clusters and n_samples
Large dataset, outlier removal, data reduction.
Euclidean distance between points
Source: http://scikit-learn.org/stable/modules/clustering.html
Exercise Set VI
Exercise: Try clustering using the following algorithms.
Affinity propagation
Spectral clustering
Agglomerative clustering
DBSCAN
How do their results compare? Which performs the best? Tell a story why you think it per-
forms the best.
(Partial code below are from https://github.com/dpalbrecht/Springboard-Exercises)

In [165]: # 1. affinity propagation


from matplotlib.cm import rainbow
from sklearn.cluster import AffinityPropagation

cluster = AffinityPropagation()
df_pivot['cluster'] = cluster.fit_predict(x_col)

# Take customer names and clusters from the pivot dataframe


df_pivot_short = pd.DataFrame(df_pivot['cluster']).reset_index()
print('Number of clusters: {}'.format(len(df_pivot_short['cluster'].uniqu

# Initialize and fit a PCA with 2 components


pca = sklearn.decomposition.PCA(n_components=2)
# Transform the values matrix
X = pca.fit_transform(x_col)

27
# Concatenate the data frames
df_pivot_short_2 = pd.DataFrame(X, columns=['x', 'y'])
df_pca = pd.concat([df_pivot_short, df_pivot_short_2], axis=1)

# Create a scatterplot of the reduced data


colors = rainbow(np.linspace(0, 1, len(df_pivot['cluster'].unique())))
plt.scatter(x=df_pca['x'], y=df_pca['y'], c=colors)
plt.xticks(size=10)
plt.xlabel('1st principal component', size=12)
plt.yticks(size=10)
plt.ylabel('2nd principal component', size=12)
plt.title('Affinity Propogation\nCluster Representation by\n 2 Principal
Number of clusters: 8

In [168]: # 2. Try spectral clustering

28
from sklearn.cluster import SpectralClustering

cluster = SpectralClustering(n_clusters=3)
df_pivot['cluster'] = cluster.fit_predict(x_col)

# Take customer names and clusters from the pivot dataframe


df_pivot_short = pd.DataFrame(df_pivot['cluster']).reset_index()
print('Number of clusters: {}'.format(len(df_pivot_short['cluster'].uniqu

# Initialize and fit a PCA with 2 components


pca = sklearn.decomposition.PCA(n_components=2)
# Transform the values matrix
X = pca.fit_transform(x_col)

# Concatenate the data frames


df_pivot_short_2 = pd.DataFrame(X, columns=['x', 'y'])
df_pca = pd.concat([df_pivot_short, df_pivot_short_2], axis=1)

# Create a scatterplot of the reduced data


colors = rainbow(np.linspace(0, 1, len(df_pivot['cluster'].unique())))
plt.scatter(x=df_pca['x'], y=df_pca['y'], c=colors)
plt.xticks(size=10)
plt.xlabel('1st principal component', size=12)
plt.yticks(size=10)
plt.ylabel('2nd principal component', size=12)
plt.title('Spectral Clustering\nCluster Representation by\n 2 Principal C

Number of clusters: 3

29
In [169]: # 3. Try agglomerative clustering

from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=3)
df_pivot['cluster'] = cluster.fit_predict(x_col)

# Take customer names and clusters from the pivot dataframe


df_pivot_short = pd.DataFrame(df_pivot['cluster']).reset_index()
print('Number of clusters: {}'.format(len(df_pivot_short['cluster'].uniqu

# Initialize and fit a PCA with 2 components


pca = sklearn.decomposition.PCA(n_components=2)
# Transform the values matrix
X = pca.fit_transform(x_col)

30
# Concatenate the data frames
df_pivot_short_2 = pd.DataFrame(X, columns=['x', 'y'])
df_pca = pd.concat([df_pivot_short, df_pivot_short_2], axis=1)

# Create a scatterplot of the reduced data


colors = rainbow(np.linspace(0, 1, len(df_pivot['cluster'].unique())))
plt.scatter(x=df_pca['x'], y=df_pca['y'], c=colors)
plt.xticks(size=10)
plt.xlabel('1st principal component', size=12)
plt.yticks(size=10)
plt.ylabel('2nd principal component', size=12)
plt.title('Agglomerative Clustering\nCluster Representation by\n 2 Princi
Number of clusters: 3

In [172]: # 4. Try DBSCAN


from sklearn.cluster import DBSCAN

31
cluster = DBSCAN(min_samples=3)
df_pivot['cluster'] = cluster.fit_predict(x_col)

# Take customer names and clusters from the pivot dataframe


df_pivot_short = pd.DataFrame(df_pivot['cluster']).reset_index()
print('Number of clusters: {}'.format(len(df_pivot_short['cluster'].uniqu
print('Percent of instances classified as noise: {}%'\
.format((len(df_pivot_short[df_pivot_short['cluster'] == -1])/len(d

# Initialize and fit a PCA with 2 components


pca = sklearn.decomposition.PCA(n_components=2)
# Transform the values matrix
X = pca.fit_transform(x_col)

# Concatenate the data frames


df_pivot_short_2 = pd.DataFrame(X, columns=['x', 'y'])
df_pca = pd.concat([df_pivot_short, df_pivot_short_2], axis=1)

# Create a scatterplot of the reduced data


colors = rainbow(np.linspace(0, 1, len(df_pivot['cluster'].unique())))
plt.scatter(x=df_pca['x'], y=df_pca['y'], c=colors)
plt.xticks(size=10)
plt.xlabel('1st principal component', size=12)
plt.yticks(size=10)
plt.ylabel('2nd principal component', size=12)
plt.title('DBSCAN\nCluster Representation by\n 2 Principal Components', s

Number of clusters: 4
Percent of instances classified as noise: 91.0%

32
33

You might also like