0% found this document useful (0 votes)
40 views35 pages

Dimensionality Reduction: Motivation I: Data Compression

The document discusses dimensionality reduction and principal component analysis (PCA). It provides motivation for dimensionality reduction including data compression and data visualization. It then discusses the PCA problem formulation, which aims to find a low dimensional projection of the data that minimizes projection error. The PCA algorithm is described, including preprocessing data, computing the covariance matrix, and obtaining principal components from the singular value decomposition of the covariance matrix. Reconstruction from the compressed representation is also covered. Methods for choosing the number of principal components like retaining 99% of variance are presented.

Uploaded by

Ali Nasir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views35 pages

Dimensionality Reduction: Motivation I: Data Compression

The document discusses dimensionality reduction and principal component analysis (PCA). It provides motivation for dimensionality reduction including data compression and data visualization. It then discusses the PCA problem formulation, which aims to find a low dimensional projection of the data that minimizes projection error. The PCA algorithm is described, including preprocessing data, computing the covariance matrix, and obtaining principal components from the singular value decomposition of the covariance matrix. Reconstruction from the compressed representation is also covered. Methods for choosing the number of principal components like retaining 99% of variance are presented.

Uploaded by

Ali Nasir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

20/04/2021

Dimensionality
Reduction
Motivation I:
Data Compression

Machine Learning

Data Compression
Reduce data from
(inches)

2D to 1D

(cm)

Andrew Ng

1
20/04/2021

Data Compression
Reduce data from
(inches)

2D to 1D

(cm)

Andrew Ng

Data Compression
Reduce data from 3D to 2D

Andrew Ng

2
20/04/2021

Andrew Ng

Dimensionality
Reduction
Motivation II:
Data Visualization

Machine Learning

3
20/04/2021

Data Visualization
Mean
Per capita Poverty household
GDP GDP Human Index income
(trillions of (thousands Develop- Life (Gini as (thousands
Country US$) of intl. $) ment Index expectancy percentage) of US$) …
Canada 1.577 39.17 0.908 80.7 32.6 67.293 …
China 5.878 7.54 0.687 73 46.9 10.22 …
India 1.632 3.41 0.547 64.7 36.8 0.735 …
Russia 1.48 19.84 0.755 65.5 39.9 0.72 …
Singapore 0.223 56.69 0.866 80 42.5 67.1 …
USA 14.527 46.86 0.91 78.3 40.8 84.3 …
… … … … … … …
[resources from en.wikipedia.org] Andrew Ng

Data Visualization
Country
Canada 1.6 1.2
China 1.7 0.3
India 1.6 0.2
Russia 1.4 0.5
Singapore 0.5 1.7
USA 2 1.5
… … …
Andrew Ng

4
20/04/2021

Data Visualization

Andrew Ng

Andrew Ng

5
20/04/2021

Dimensionality
Reduction
Principal Component
Analysis problem
formulation
Machine Learning (PCA is the generalized algo. that is used for dimensionality reduction)

Principal Component Analysis (PCA) problem formulation


(PCA tries to find a low dimensional hyperplane (projection
hyperplane) which has the minimum projection error)

Andrew Ng

6
20/04/2021

Principal Component Analysis (PCA) problem formulation

Reduce from 2-dimension to 1-dimension: Find a direction (a vector )


onto which to project the data so as to minimize the projection error.
Reduce from n-dimension to k-dimension: Find vectors
onto which to project the data, so as to minimize the projection error.

Andrew Ng

PCA is not linear regression (PCA tries to minimize (squared) projection error not MSE)

Andrew Ng

7
20/04/2021

PCA is not linear regression

Andrew Ng

Dimensionality
Reduction
Principal Component
Analysis algorithm

Machine Learning

8
20/04/2021

Data preprocessing
Training set:
Preprocessing (feature scaling/mean normalization):

Replace each with .


If different features on different scales (e.g., size of house,
number of bedrooms), scale features to have comparable
range of values.

Andrew Ng

Principal Component Analysis (PCA) algorithm

Reduce data from 2D to 1D Reduce data from 3D to 2D

Andrew Ng

9
20/04/2021

Symbol sigma representing covariance matrix

Principal Component Analysis (PCA) algorithm


Reduce data from -dimensions to -dimensions
Compute “covariance matrix”:

Compute “eigenvectors” of matrix :


[U,S,V] = svd(Sigma);

Andrew Ng

Andrew Ng

10
20/04/2021

Andrew Ng

Andrew Ng

11
20/04/2021

Principal Component Analysis (PCA) algorithm


From [U,S,V] = svd(Sigma) , we get:

Andrew Ng

Principal Component Analysis (PCA) algorithm summary


After mean normalization (ensure every feature has
zero mean) and optionally feature scaling:
Sigma =

[U,S,V] = svd(Sigma);
Ureduce = U(:,1:k);
z = Ureduce’*x;

Andrew Ng

12
20/04/2021

Andrew Ng

Dimensionality
Reduction
Reconstruction from
compressed
representation
Machine Learning

13
20/04/2021

Reconstruction from compressed representation

Andrew Ng

Andrew Ng

14
20/04/2021

Dimensionality
Reduction
Choosing the number of
principal components

Machine Learning

Choosing (number of principal components)


Average squared projection error:
Total variation in the data:
Typically, choose to be smallest value so that

(1%)

“99% of variance is retained”

Andrew Ng

15
20/04/2021

Choosing (number of principal components)


Algorithm: [U,S,V] = svd(Sigma)
Try PCA with
Compute

Check if

Andrew Ng

Choosing (number of principal components)


[U,S,V] = svd(Sigma)
Pick smallest value of for which

(99% of variance retained)

Andrew Ng

16
20/04/2021

Andrew Ng

Dimensionality
Reduction
Advice for
applying PCA
Machine Learning

17
20/04/2021

Supervised learning speedup

Extract inputs:
Unlabeled dataset:

New training set:

Note: Mapping should be defined by running PCA


only on the training set. This mapping can be applied as well to
the examples and in the cross validation and test sets.
Andrew Ng

Application of PCA

- Compression
- Reduce memory/disk needed to store data
- Speed up learning algorithm

- Visualization

Andrew Ng

18
20/04/2021

Bad use of PCA: To prevent overfitting


Use instead of to reduce the number of
features to
Thus, fewer features, less likely to overfit.

This might work OK, but isn’t a good way to address


overfitting. Use regularization instead.

Andrew Ng

PCA is sometimes used where it shouldn’t be


Design of ML system:
- Get training set
- Run PCA to reduce in dimension to get
- Train logistic regression on
- Test on test set: Map to . Run on

How about doing the whole thing without using PCA?


Before implementing PCA, first try running whatever you want to
do with the original/raw data . Only if that doesn’t do what
you want, then implement PCA and consider using .
Andrew Ng

19
20/04/2021

Andrew Ng

PCA Code
from sklearn.Decomposition import PCA
PCAcom = 2

pca = PCA()
pca.n_components = PCAcom
pca_data = pca.fit_transform(sample_data)
pca_df = pd.DataFrame(data=pca_data)

40
Andrew Ng

20
20/04/2021

PCA for Data Visualization


• For a lot of machine learning applications it helps to
be able to visualize your data
• Visualizing 2 or 3 dimensional data is not that
challenging.
• However, even the Iris dataset used in this part of
the tutorial is 4 dimensional.

41
Andrew Ng

Load Iris Dataset


• The Iris dataset is one of datasets scikit-learn comes
with that do not require the downloading of any file
from some external website. The code below will
load the iris dataset.

42
Andrew Ng

21
20/04/2021

Load Iris Dataset


import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data"
# load dataset into Pandas DataFrame
df = pd.read_csv(url, names=['sepal length','sepal
width','petal length','petal width','target'])

43
Andrew Ng

Standardize the Data


• PCA is effected by scale so you need to scale the
features in your data before applying PCA
• Use StandardScaler to help you standardize the
dataset’s features onto unit scale (mean = 0 and
variance = 1) which is a requirement for the optimal
performance of many machine learning algorithms.

44
Andrew Ng

22
20/04/2021

Standardize the Data


from sklearn.preprocessing import StandardScaler
features = ['sepal length', 'sepal width', 'petal length',
'petal width']
# Separating out the features
x = df.loc[:, features].values

45
Andrew Ng

Standardize the Data


# Separating out the target
y = df.loc[:,['target']].values
# Standardizing the features
x = StandardScaler().fit_transform(x)

46
Andrew Ng

23
20/04/2021

Standardize the Data


d

47
Andrew Ng

PCA Projection to 2D
• The original data has 4 columns (sepal length, sepal
width, petal length, and petal width)
• In this section, the code projects the original data
which is 4 dimensional into 2 dimensions
• The new components are just the two main
dimensions of variation.

48
Andrew Ng

24
20/04/2021

PCA Projection to 2D
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data =
principalComponents, columns = ['principal
component 1', 'principal component 2'])

49
Andrew Ng

PCA Projection to 2D
d

50
Andrew Ng

25
20/04/2021

PCA Projection to 2D
finalDf = pd.concat([principalDf, df[['target']]], axis = 1)

• Concatenating DataFrame along axis = 1. finalDf is


the final DataFrame before plotting the data.

51
Andrew Ng

Visualize 2D Projection
• This section is just plotting 2 dimensional data.
Notice on the graph below that the classes seem
well separated from each other.
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
52
Andrew Ng

26
20/04/2021

Visualize 2D Projection
targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
indicesToKeep = finalDf['target'] == target
ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
, finalDf.loc[indicesToKeep, 'principal component 2']
, c = color
, s = 50)
ax.legend(targets)
ax.grid()
53
Andrew Ng

Visualize 2D Projection
d

54
Andrew Ng

27
20/04/2021

Explained Variance
• The explained variance tells you how much
information (variance) can be attributed to each of
the principal components.
• This is important as while you can convert 4
dimensional space to 2 dimensional space, you lose
some of the variance (information) when you do
this

55
Andrew Ng

Explained Variance
• By using the attribute explained_variance_ratio_, you
can see that the first principal component contains
72.77% of the variance and the second principal
component contains 23.03% of the variance.

• Together, the two components contain 95.80% of


the information.
pca.explained_variance_ratio_

56
Andrew Ng

28
20/04/2021

PCA to Speed-up Machine Learning Algorithms


• The MNIST database of handwritten digits is more
suitable as it has 784 feature columns (784
dimensions), a training set of 60,000 examples, and
a test set of 10,000 examples.

57
Andrew Ng

Download and Load the Data


• You can also add a data_home parameter to
fetch_mldata to change where you download the
data.

from sklearn.datasets import fetch_openml


mnist = fetch_openml('mnist_784')

58
Andrew Ng

29
20/04/2021

Download and Load the Data


• The images that you downloaded are contained in
mnist.data and has a shape of (70000, 784)
meaning there are 70,000 images with 784
dimensions (784 features).

59
Andrew Ng

Download and Load the Data


• The labels (the integers 0–9) are contained in
mnist.target.

• The features are 784 dimensional (28 x 28 images)


and the labels are simply numbers from 0–9.

60
Andrew Ng

30
20/04/2021

Split Data into Training and Test Sets


• Typically the train test split is 80% training and 20%
test.
• In this case, I chose 6/7th of the data to be training
and 1/7th of the data to be in the test set.

61
Andrew Ng

Split Data into Training and Test Sets


from sklearn.model_selection import train_test_split
# test_size: what proportion of original data is used for
test set
train_img, test_img, train_lbl, test_lbl =
train_test_split( mnist.data, mnist.target,
test_size=1/7.0, random_state=0)

62
Andrew Ng

31
20/04/2021

Standardize the Data


• The text in this paragraph is almost an exact copy of
what was written earlier

63
Andrew Ng

Standardize the Data


• PCA is effected by scale so you need to scale the
features in the data before applying PCA
• You can transform the data onto unit scale (mean =
0 and variance = 1) which is a requirement for the
optimal performance of many machine learning
algorithms

64
Andrew Ng

32
20/04/2021

Standardize the Data


• StandardScaler helps standardize the dataset’s
features. Note you fit on the training set and
transform on the training and test set

65
Andrew Ng

Standardize the Data


• Note you fit on the training set and transform on
the training and test set. If you want to see the
negative effect not scaling your data can have,
scikit-learn has a section on the effects of not
standardizing your data.

66
Andrew Ng

33
20/04/2021

Standardize the Data


from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit on training set only.
scaler.fit(train_img)
# Apply transform to both the training set and the test
set.
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)

67
Andrew Ng

Import and Apply PCA


• Notice the code below has .95 for the number of
components parameter. It means that scikit-learn
choose the minimum number of principal
components such that 95% of the variance is
retained.

68
Andrew Ng

34
20/04/2021

Import and Apply PCA


from sklearn.decomposition import PCA
# Make an instance of the Model
pca = PCA(.95)

69
Andrew Ng

35

You might also like