0% found this document useful (0 votes)
88 views

Fem2063 Data Analytics - May 2020 Lab Practice 5 (Week 6)

1. The document discusses unsupervised learning methods including singular value decomposition (SVD) and principal component analysis (PCA). It provides examples applying SVD and PCA to matrix decomposition and dimensionality reduction on face datasets. 2. PCA aims to find directions of maximum variance in high-dimensional data and project it to a lower-dimensional space. It is demonstrated on randomly generated points and applied to images from a face dataset, showing the eigenfaces that describe most of the variance. 3. Further examples show reconstructing faces from their 150 most important principal components, retaining over 90% of the information while reducing dimensionality by a factor of 20. PCA is then used with logistic regression for classification.

Uploaded by

Zhi yi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views

Fem2063 Data Analytics - May 2020 Lab Practice 5 (Week 6)

1. The document discusses unsupervised learning methods including singular value decomposition (SVD) and principal component analysis (PCA). It provides examples applying SVD and PCA to matrix decomposition and dimensionality reduction on face datasets. 2. PCA aims to find directions of maximum variance in high-dimensional data and project it to a lower-dimensional space. It is demonstrated on randomly generated points and applied to images from a face dataset, showing the eigenfaces that describe most of the variance. 3. Further examples show reconstructing faces from their 150 most important principal components, retaining over 90% of the information while reducing dimensionality by a factor of 20. PCA is then used with logistic regression for classification.

Uploaded by

Zhi yi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

FEM2063 DATA ANALYTICS - MAY 2020

LAB PRACTICE 5 (WEEK 6)

Learning Outcomes
The goal of this lab session is to learn the implementations of the following unsupervised learning
methods using Python:
1. Singular value decomposition (SVD)
2. Principle component analysis (PCA)

******************************************************************************
1. SVD
a. SVD of a matrix
2 0 2
[ ]
1 2 −1

import numpy as np
import scipy
from scipy.linalg import svd
A= np.array([[2,0,2],[1,2,-1]])
U,S, Vt = svd(A)
print(A)
print(U)
print(S)
print(Vt)

The function takes a matrix and returns the U, Sigma and 𝑉 𝑡 elements. The sigma diagonal
matrix is returned as a vector of singular values. The V matrix is returned in a transposed
form, e.g 𝑉 𝑡 .

b. SVD on a Face dataset


We will use the Labelled Faces in the Wild dataset, which consists of several thousand
collated photos of various public figures. A fetcher for the dataset is built into Scikit-Learn:

from sklearn.datasets import fetch_lfw_people


faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)

import matplotlib.pyplot as plt

1
FEM2063 DATA ANALYTICS - MAY 2020

fig, ax = plt.subplots(3, 5)
for i, axi in enumerate(ax.flat):
axi.imshow(faces.images[i], cmap='bone')
axi.set(xticks=[], yticks=[],
xlabel=faces.target_names[faces.target[i]])

from numpy import array


from scipy.linalg import svd
U, s, VT = svd(faces.data, full_matrices=False)
from numpy import diag
from numpy import dot
#Choose the number of singular values (+1)
p = 151
B = U[:,1:p].dot(diag(s)[1:p,1:p].dot(VT[1:p,:]))
B_sub = U[:,1:p].dot(diag(s)[1:p,1:p].dot(VT[1:p,:]))
print(U.shape)
print(s.shape)
print(VT.shape)
print(U[:,1:p].shape)
print(diag(s)[1:p,1:p].shape)
print(VT[1:p,:].shape)
fig, ax = plt.subplots(3, 5)
for i, axi in enumerate(ax.flat):
axi.imshow(B_sub[i].reshape(62, 47), cmap='bone')
axi.set(xticks=[], yticks=[],
xlabel=faces.target_names[faces.target[i]])

2. Principle component analysis (PCA)

Principal Component Analysis (PCA) is an unsupervised linear transformation technique.


PCA aims to find the directions of maximum variance in high-dimensional data and
projects it onto a new subspace with equal or fewer dimensions than the original one. The
orthogonal axes (principal components) of the new subspace can be interpreted as the
directions of maximum variance given the constraint that the new feature axes are
orthogonal to each other.

a. PCA on a randomly generated set of points

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

2
FEM2063 DATA ANALYTICS - MAY 2020

rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');

The fit learns some quantities from the data, most importantly the "components" and
"explained variance":

from sklearn.decomposition import PCA


pca = PCA(n_components=2)
pca.fit(X)
print('PCA components: \n',pca.components_)
print()
print('PCA explained variance: \n', pca.explained_variance_)

To see what these numbers mean, let's visualize them as vectors over the input data, using
the "components" to define the direction of the vector, and the "explained variance" to
define the squared-length of the vector:

def draw_vector(v0, v1, ax=None):


ax = ax or plt.gca()
arrowprops=dict(arrowstyle='->',
linewidth=2,
shrinkA=0, shrinkB=0)
ax.annotate('', v1, v0, arrowprops=arrowprops)

rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
pca = PCA(n_components=2, whiten=True)
pca.fit(X)

fig, ax = plt.subplots(1, 2, figsize=(16, 6))


fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)

# plot data
ax[0].scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
v = vector * 3 * np.sqrt(length)
draw_vector(pca.mean_, pca.mean_ + v, ax=ax[0])
ax[0].axis('equal');
ax[0].set(xlabel='x', ylabel='y', title='input')

3
FEM2063 DATA ANALYTICS - MAY 2020

# plot principal components


X_pca = pca.transform(X)
ax[1].scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.2)
draw_vector([0, 0], [0, 3], ax=ax[1])
draw_vector([0, 0], [3, 0], ax=ax[1])
ax[1].axis('equal')
ax[1].set(xlabel='component 1', ylabel='component 2',
title='principal components',
xlim=(-5, 5), ylim=(-3, 3.1))

These vectors represent the principal axes of the data, and the length of the vector is an
indication of how "important" that axis is in describing the distribution of the data more
precisely, it is a measure of the variance of the data when projected onto that axis. The
projection of each data point onto the principal axes are the "principal components" of the
data.

b. PCA on a Face dataset

We will use the Labeled Faces in the Wild dataset, which consists of several thousand
collated photos of various public figures:

from sklearn.datasets import fetch_lfw_people


faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)

Let's plot a few of these faces to see what we're working with:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(3, 5)
for i, axi in enumerate(ax.flat):
axi.imshow(faces.images[i], cmap='bone')
axi.set(xticks=[], yticks=[],
xlabel=faces.target_names[faces.target[i]])

Each image contains [62×47] or nearly 3,000 pixels. We could proceed by simply using
each pixel value as a feature, but often it is more effective to use some sort of preprocessor
to extract more meaningful features; here we will use a principal component analysis to
extract 150 fundamental components.

from sklearn.decomposition import PCA


import numpy as np

4
FEM2063 DATA ANALYTICS - MAY 2020

pca = PCA(n_components=150)
pca.fit(faces.data)

In this case, it can be interesting to visualize the images associated with the first several
principal components (these components are technically known as "eigenvectors," so these
types of images are often called "eigenfaces"). As you can see in this figure, they are as
creepy as they sound:
fig, axes = plt.subplots(3, 8, figsize=(9, 4),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(pca.components_[i].reshape(62, 47), cmap='bone')

The cumulative variance of these components to see how much of the data information
the projection is preserving:

import numpy as np
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

We see that these 150 components account for just over 90% of the variance. That would
lead us to believe that using only these 150 components, we would recover most of the
essential characteristics of the data. To make this more concrete, we can compare the input
images with the images reconstructed from these 150 components:

# Compute the components and projected faces


pca = PCA(150).fit(faces.data)
components = pca.transform(faces.data)
projected = pca.inverse_transform(components)
print('Number of sample: ', components.shape[0])
print('Number of components: ', components.shape[1])

# Plot the results


fig, ax = plt.subplots(2, 10, figsize=(10, 2.5),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i in range(10):
ax[0, i].imshow(faces.data[i].reshape(62, 47), cmap='binary_r')
ax[1, i].imshow(projected[i].reshape(62, 47), cmap='binary_r')

ax[0, 0].set_ylabel('full-dim\ninput')

5
FEM2063 DATA ANALYTICS - MAY 2020

ax[1, 0].set_ylabel('150-dim\nreconstruction');

The top row here shows the input images, while the bottom row shows the reconstruction
of the images from just 150 of the ~3,000 initial features. Although it reduces the
dimensionality of the data by nearly a factor of 20, the projected images contain enough
information that we might, by eye, recognize the individuals in the image.

What this means is that our classification algorithm needs to be trained on 150-dimensional
data rather than 3,000-dimensional data, which depending on the particular algorithm we
choose, can lead to a much more efficient classification.

c. PCA and logistic regression for finding optimum parameters

The PCA does an unsupervised dimensionality reduction, while the logistic regression does
the prediction. We use a GridSearchCV to set the dimensionality of the PCA.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn import datasets


from sklearn.decomposition import PCA
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Define a pipeline to search for the best combination of PCA trunca


tion and classifier regularization.
#SGD classifier is the linear classifier which comes by default in
sklearn
#random state is a random number generator used for the splitting of
the data
logistic = SGDClassifier(loss='log', penalty='l2', early_stopping=Tr
ue, max_iter=10000, tol=1e-5, random_state=0)
pca = PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

#load digits, a dataset of handwritten digits


digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

6
FEM2063 DATA ANALYTICS - MAY 2020

param_grid = {'pca__n_components': [5, 10, 20, 30, 40, 50, 60],


'logistic__alpha': np.logspace(-4, 4, 5)}

#logspace: returns number spaces evenly w.r.t interval on a log scale.


#GridSearchCV combine an estimator with a grid search preamble to tune
hyper parameters. The method picks the optimal parameter from the grid
search which can be used with the selected estimator

search = GridSearchCV(pipe, param_grid, iid=False, cv=5)


search.fit(X_digits, y_digits)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

# Plot the PCA spectrum


pca.fit(X_digits)

fig, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, figsize=(6, 6))


ax0.plot(pca.explained_variance_ratio_.cumsum(), linewidth=2)
ax0.set_ylabel('PCA explained variance')

ax0.axvline(search.best_estimator_.named_steps['pca'].n_components,
linestyle=':', label='n_components chosen')
ax0.legend(prop=dict(size=12))

# For each number of components, find the best classifier results


results = pd.DataFrame(search.cv_results_)
components_col = 'param_pca__n_components'
best_clfs = results.groupby(components_col).apply(
lambda g: g.nlargest(1, 'mean_test_score'))

best_clfs.plot(x=components_col, y='mean_test_score', yerr='std_test


_score', legend=False, ax=ax1)
ax1.set_ylabel('Classification accuracy (val)')
ax1.set_xlabel('n_components')

plt.tight_layout()
plt.show()

7
FEM2063 DATA ANALYTICS - MAY 2020

************The End***************

You might also like