Fem2063 Data Analytics - May 2020 Lab Practice 5 (Week 6)
Fem2063 Data Analytics - May 2020 Lab Practice 5 (Week 6)
Learning Outcomes
The goal of this lab session is to learn the implementations of the following unsupervised learning
methods using Python:
1. Singular value decomposition (SVD)
2. Principle component analysis (PCA)
******************************************************************************
1. SVD
a. SVD of a matrix
2 0 2
[ ]
1 2 −1
import numpy as np
import scipy
from scipy.linalg import svd
A= np.array([[2,0,2],[1,2,-1]])
U,S, Vt = svd(A)
print(A)
print(U)
print(S)
print(Vt)
The function takes a matrix and returns the U, Sigma and 𝑉 𝑡 elements. The sigma diagonal
matrix is returned as a vector of singular values. The V matrix is returned in a transposed
form, e.g 𝑉 𝑡 .
1
FEM2063 DATA ANALYTICS - MAY 2020
fig, ax = plt.subplots(3, 5)
for i, axi in enumerate(ax.flat):
axi.imshow(faces.images[i], cmap='bone')
axi.set(xticks=[], yticks=[],
xlabel=faces.target_names[faces.target[i]])
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
2
FEM2063 DATA ANALYTICS - MAY 2020
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');
The fit learns some quantities from the data, most importantly the "components" and
"explained variance":
To see what these numbers mean, let's visualize them as vectors over the input data, using
the "components" to define the direction of the vector, and the "explained variance" to
define the squared-length of the vector:
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
pca = PCA(n_components=2, whiten=True)
pca.fit(X)
# plot data
ax[0].scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
v = vector * 3 * np.sqrt(length)
draw_vector(pca.mean_, pca.mean_ + v, ax=ax[0])
ax[0].axis('equal');
ax[0].set(xlabel='x', ylabel='y', title='input')
3
FEM2063 DATA ANALYTICS - MAY 2020
These vectors represent the principal axes of the data, and the length of the vector is an
indication of how "important" that axis is in describing the distribution of the data more
precisely, it is a measure of the variance of the data when projected onto that axis. The
projection of each data point onto the principal axes are the "principal components" of the
data.
We will use the Labeled Faces in the Wild dataset, which consists of several thousand
collated photos of various public figures:
Let's plot a few of these faces to see what we're working with:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(3, 5)
for i, axi in enumerate(ax.flat):
axi.imshow(faces.images[i], cmap='bone')
axi.set(xticks=[], yticks=[],
xlabel=faces.target_names[faces.target[i]])
Each image contains [62×47] or nearly 3,000 pixels. We could proceed by simply using
each pixel value as a feature, but often it is more effective to use some sort of preprocessor
to extract more meaningful features; here we will use a principal component analysis to
extract 150 fundamental components.
4
FEM2063 DATA ANALYTICS - MAY 2020
pca = PCA(n_components=150)
pca.fit(faces.data)
In this case, it can be interesting to visualize the images associated with the first several
principal components (these components are technically known as "eigenvectors," so these
types of images are often called "eigenfaces"). As you can see in this figure, they are as
creepy as they sound:
fig, axes = plt.subplots(3, 8, figsize=(9, 4),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(pca.components_[i].reshape(62, 47), cmap='bone')
The cumulative variance of these components to see how much of the data information
the projection is preserving:
import numpy as np
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
We see that these 150 components account for just over 90% of the variance. That would
lead us to believe that using only these 150 components, we would recover most of the
essential characteristics of the data. To make this more concrete, we can compare the input
images with the images reconstructed from these 150 components:
ax[0, 0].set_ylabel('full-dim\ninput')
5
FEM2063 DATA ANALYTICS - MAY 2020
ax[1, 0].set_ylabel('150-dim\nreconstruction');
The top row here shows the input images, while the bottom row shows the reconstruction
of the images from just 150 of the ~3,000 initial features. Although it reduces the
dimensionality of the data by nearly a factor of 20, the projected images contain enough
information that we might, by eye, recognize the individuals in the image.
What this means is that our classification algorithm needs to be trained on 150-dimensional
data rather than 3,000-dimensional data, which depending on the particular algorithm we
choose, can lead to a much more efficient classification.
The PCA does an unsupervised dimensionality reduction, while the logistic regression does
the prediction. We use a GridSearchCV to set the dimensionality of the PCA.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
6
FEM2063 DATA ANALYTICS - MAY 2020
ax0.axvline(search.best_estimator_.named_steps['pca'].n_components,
linestyle=':', label='n_components chosen')
ax0.legend(prop=dict(size=12))
plt.tight_layout()
plt.show()
7
FEM2063 DATA ANALYTICS - MAY 2020
************The End***************