0% found this document useful (0 votes)

40 views35 pages

Dimensionality Reduction: Motivation I: Data Compression

The document discusses dimensionality reduction and principal component analysis (PCA). It provides motivation for dimensionality reduction including data compression and data visualization. It then discusses the PCA problem formulation, which aims to find a low dimensional projection of the data that minimizes projection error. The PCA algorithm is described, including preprocessing data, computing the covariance matrix, and obtaining principal components from the singular value decomposition of the covariance matrix. Reconstruction from the compressed representation is also covered. Methods for choosing the number of principal components like retaining 99% of variance are presented.

Uploaded by

Ali Nasir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views35 pages

Dimensionality Reduction: Motivation I: Data Compression

Uploaded by

Ali Nasir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

20/04/2021

Dimensionality
Reduction
Motivation I:
Data Compression

Machine Learning

Data Compression
Reduce data from
(inches)

2D to 1D

(cm)

Andrew Ng

1
20/04/2021

Data Compression
Reduce data from
(inches)

2D to 1D

(cm)

Andrew Ng

Data Compression
Reduce data from 3D to 2D

Andrew Ng

2
20/04/2021

Andrew Ng

Dimensionality
Reduction
Motivation II:
Data Visualization

Machine Learning

3
20/04/2021

Data Visualization
Mean
Per capita Poverty household
GDP GDP Human Index income
(trillions of (thousands Develop- Life (Gini as (thousands
Country US$) of intl. $) ment Index expectancy percentage) of US$) …
Canada 1.577 39.17 0.908 80.7 32.6 67.293 …
China 5.878 7.54 0.687 73 46.9 10.22 …
India 1.632 3.41 0.547 64.7 36.8 0.735 …
Russia 1.48 19.84 0.755 65.5 39.9 0.72 …
Singapore 0.223 56.69 0.866 80 42.5 67.1 …
USA 14.527 46.86 0.91 78.3 40.8 84.3 …
… … … … … … …
[resources from en.wikipedia.org] Andrew Ng

Data Visualization
Country
Canada 1.6 1.2
China 1.7 0.3
India 1.6 0.2
Russia 1.4 0.5
Singapore 0.5 1.7
USA 2 1.5
… … …
Andrew Ng

4
20/04/2021

Data Visualization

Andrew Ng

5
20/04/2021

Dimensionality
Reduction
Principal Component
Analysis problem
formulation
Machine Learning (PCA is the generalized algo. that is used for dimensionality reduction)

Principal Component Analysis (PCA) problem formulation

(PCA tries to find a low dimensional hyperplane (projection
hyperplane) which has the minimum projection error)

Andrew Ng

6
20/04/2021

Principal Component Analysis (PCA) problem formulation

Reduce from 2-dimension to 1-dimension: Find a direction (a vector )

onto which to project the data so as to minimize the projection error.
Reduce from n-dimension to k-dimension: Find vectors
onto which to project the data, so as to minimize the projection error.

Andrew Ng

PCA is not linear regression (PCA tries to minimize (squared) projection error not MSE)

Andrew Ng

7
20/04/2021

PCA is not linear regression

Andrew Ng

Dimensionality
Reduction
Principal Component
Analysis algorithm

Machine Learning

8
20/04/2021

Data preprocessing
Training set:
Preprocessing (feature scaling/mean normalization):

Replace each with .

If different features on different scales (e.g., size of house,
number of bedrooms), scale features to have comparable
range of values.

Andrew Ng

Principal Component Analysis (PCA) algorithm

Reduce data from 2D to 1D Reduce data from 3D to 2D

Andrew Ng

9
20/04/2021

Symbol sigma representing covariance matrix

Principal Component Analysis (PCA) algorithm

Reduce data from -dimensions to -dimensions
Compute “covariance matrix”:

Compute “eigenvectors” of matrix :

[U,S,V] = svd(Sigma);

Andrew Ng

10
20/04/2021

Andrew Ng

11
20/04/2021

Principal Component Analysis (PCA) algorithm

From [U,S,V] = svd(Sigma) , we get:

Andrew Ng

Principal Component Analysis (PCA) algorithm summary

After mean normalization (ensure every feature has
zero mean) and optionally feature scaling:
Sigma =

[U,S,V] = svd(Sigma);
Ureduce = U(:,1:k);
z = Ureduce’*x;

Andrew Ng

12
20/04/2021

Andrew Ng

Dimensionality
Reduction
Reconstruction from
compressed
representation
Machine Learning

13
20/04/2021

Reconstruction from compressed representation

Andrew Ng

14
20/04/2021

Dimensionality
Reduction
Choosing the number of
principal components

Machine Learning

Choosing (number of principal components)

Average squared projection error:
Total variation in the data:
Typically, choose to be smallest value so that

(1%)

“99% of variance is retained”

Andrew Ng

15
20/04/2021

Choosing (number of principal components)

Algorithm: [U,S,V] = svd(Sigma)
Try PCA with
Compute

Check if

Andrew Ng

Choosing (number of principal components)

[U,S,V] = svd(Sigma)
Pick smallest value of for which

(99% of variance retained)

Andrew Ng

16
20/04/2021

Andrew Ng

Dimensionality
Reduction
Advice for
applying PCA
Machine Learning

17
20/04/2021

Supervised learning speedup

Extract inputs:
Unlabeled dataset:

New training set:

Note: Mapping should be defined by running PCA

only on the training set. This mapping can be applied as well to
the examples and in the cross validation and test sets.
Andrew Ng

Application of PCA

- Compression
- Reduce memory/disk needed to store data
- Speed up learning algorithm

- Visualization

Andrew Ng

18
20/04/2021

Bad use of PCA: To prevent overfitting

Use instead of to reduce the number of
features to
Thus, fewer features, less likely to overfit.

This might work OK, but isn’t a good way to address

overfitting. Use regularization instead.

Andrew Ng

PCA is sometimes used where it shouldn’t be

Design of ML system:
- Get training set
- Run PCA to reduce in dimension to get
- Train logistic regression on
- Test on test set: Map to . Run on

How about doing the whole thing without using PCA?

Before implementing PCA, first try running whatever you want to
do with the original/raw data . Only if that doesn’t do what
you want, then implement PCA and consider using .
Andrew Ng

19
20/04/2021

Andrew Ng

PCA Code
from sklearn.Decomposition import PCA
PCAcom = 2

pca = PCA()
pca.n_components = PCAcom
pca_data = pca.fit_transform(sample_data)
pca_df = pd.DataFrame(data=pca_data)

40
Andrew Ng

20
20/04/2021

PCA for Data Visualization

• For a lot of machine learning applications it helps to
be able to visualize your data
• Visualizing 2 or 3 dimensional data is not that
challenging.
• However, even the Iris dataset used in this part of
the tutorial is 4 dimensional.

41
Andrew Ng

Load Iris Dataset

• The Iris dataset is one of datasets scikit-learn comes
with that do not require the downloading of any file
from some external website. The code below will
load the iris dataset.

42
Andrew Ng

21
20/04/2021

Load Iris Dataset

import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data"
# load dataset into Pandas DataFrame
df = pd.read_csv(url, names=['sepal length','sepal
width','petal length','petal width','target'])

43
Andrew Ng

Standardize the Data

• PCA is effected by scale so you need to scale the
features in your data before applying PCA
• Use StandardScaler to help you standardize the
dataset’s features onto unit scale (mean = 0 and
variance = 1) which is a requirement for the optimal
performance of many machine learning algorithms.

44
Andrew Ng

22
20/04/2021

Standardize the Data

from sklearn.preprocessing import StandardScaler
features = ['sepal length', 'sepal width', 'petal length',
'petal width']
# Separating out the features
x = df.loc[:, features].values

45
Andrew Ng

Standardize the Data

# Separating out the target
y = df.loc[:,['target']].values
# Standardizing the features
x = StandardScaler().fit_transform(x)

46
Andrew Ng

23
20/04/2021

Standardize the Data

47
Andrew Ng

PCA Projection to 2D
• The original data has 4 columns (sepal length, sepal
width, petal length, and petal width)
• In this section, the code projects the original data
which is 4 dimensional into 2 dimensions
• The new components are just the two main
dimensions of variation.

48
Andrew Ng

24
20/04/2021

PCA Projection to 2D
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data =
principalComponents, columns = ['principal
component 1', 'principal component 2'])

49
Andrew Ng

PCA Projection to 2D
d

50
Andrew Ng

25
20/04/2021

PCA Projection to 2D
finalDf = pd.concat([principalDf, df[['target']]], axis = 1)

• Concatenating DataFrame along axis = 1. finalDf is

the final DataFrame before plotting the data.

51
Andrew Ng

Visualize 2D Projection
• This section is just plotting 2 dimensional data.
Notice on the graph below that the classes seem
well separated from each other.
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
52
Andrew Ng

26
20/04/2021

Visualize 2D Projection
targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
indicesToKeep = finalDf['target'] == target
ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
, finalDf.loc[indicesToKeep, 'principal component 2']
, c = color
, s = 50)
ax.legend(targets)
ax.grid()
53
Andrew Ng

Visualize 2D Projection
d

54
Andrew Ng

27
20/04/2021

Explained Variance
• The explained variance tells you how much
information (variance) can be attributed to each of
the principal components.
• This is important as while you can convert 4
dimensional space to 2 dimensional space, you lose
some of the variance (information) when you do
this

55
Andrew Ng

Explained Variance
• By using the attribute explained_variance_ratio_, you
can see that the first principal component contains
72.77% of the variance and the second principal
component contains 23.03% of the variance.

• Together, the two components contain 95.80% of

the information.
pca.explained_variance_ratio_

56
Andrew Ng

28
20/04/2021

PCA to Speed-up Machine Learning Algorithms

• The MNIST database of handwritten digits is more
suitable as it has 784 feature columns (784
dimensions), a training set of 60,000 examples, and
a test set of 10,000 examples.

57
Andrew Ng

Download and Load the Data

• You can also add a data_home parameter to
fetch_mldata to change where you download the
data.

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784')

58
Andrew Ng

29
20/04/2021

Download and Load the Data

• The images that you downloaded are contained in
mnist.data and has a shape of (70000, 784)
meaning there are 70,000 images with 784
dimensions (784 features).

59
Andrew Ng

Download and Load the Data

• The labels (the integers 0–9) are contained in
mnist.target.

• The features are 784 dimensional (28 x 28 images)

and the labels are simply numbers from 0–9.

60
Andrew Ng

30
20/04/2021

Split Data into Training and Test Sets

• Typically the train test split is 80% training and 20%
test.
• In this case, I chose 6/7th of the data to be training
and 1/7th of the data to be in the test set.

61
Andrew Ng

Split Data into Training and Test Sets

from sklearn.model_selection import train_test_split
# test_size: what proportion of original data is used for
test set
train_img, test_img, train_lbl, test_lbl =
train_test_split( mnist.data, mnist.target,
test_size=1/7.0, random_state=0)

62
Andrew Ng

31
20/04/2021

Standardize the Data

• The text in this paragraph is almost an exact copy of
what was written earlier

63
Andrew Ng

Standardize the Data

• PCA is effected by scale so you need to scale the
features in the data before applying PCA
• You can transform the data onto unit scale (mean =
0 and variance = 1) which is a requirement for the
optimal performance of many machine learning
algorithms

64
Andrew Ng

32
20/04/2021

Standardize the Data

• StandardScaler helps standardize the dataset’s
features. Note you fit on the training set and
transform on the training and test set

65
Andrew Ng

Standardize the Data

• Note you fit on the training set and transform on
the training and test set. If you want to see the
negative effect not scaling your data can have,
scikit-learn has a section on the effects of not
standardizing your data.

66
Andrew Ng

33
20/04/2021

Standardize the Data

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit on training set only.
scaler.fit(train_img)
# Apply transform to both the training set and the test
set.
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)

67
Andrew Ng

Import and Apply PCA

• Notice the code below has .95 for the number of
components parameter. It means that scikit-learn
choose the minimum number of principal
components such that 95% of the variance is
retained.

68
Andrew Ng

34
20/04/2021

Import and Apply PCA

from sklearn.decomposition import PCA
# Make an instance of the Model
pca = PCA(.95)

69
Andrew Ng

Cat Dozer Differential Steering
100% (3)
Cat Dozer Differential Steering
145 pages
Dimensionality Reduc1on
No ratings yet
Dimensionality Reduc1on
30 pages
Lecture 14
No ratings yet
Lecture 14
30 pages
CS464_Ch6_FeatureExtraction
No ratings yet
CS464_Ch6_FeatureExtraction
46 pages
Linear Regression: Dimensionality Reduction
No ratings yet
Linear Regression: Dimensionality Reduction
7 pages
Love Report 1
No ratings yet
Love Report 1
10 pages
Unit 3
No ratings yet
Unit 3
102 pages
Principal Component Analysis Notes : Info
No ratings yet
Principal Component Analysis Notes : Info
22 pages
Clustering_and_dimensionality_reduction_techniques__PCA__t_SNE__K_means_ (1)
No ratings yet
Clustering_and_dimensionality_reduction_techniques__PCA__t_SNE__K_means_ (1)
15 pages
Chapter Five Principal Comonent Analysis (PCA)
No ratings yet
Chapter Five Principal Comonent Analysis (PCA)
33 pages
Reduce Data Dimensionality Using PCA
No ratings yet
Reduce Data Dimensionality Using PCA
6 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
11 pages
Assignment
No ratings yet
Assignment
24 pages
Love Report
No ratings yet
Love Report
7 pages
Lab #3
No ratings yet
Lab #3
12 pages
Machine Learning (CSO851) - Lecture 03
No ratings yet
Machine Learning (CSO851) - Lecture 03
71 pages
Module 3
No ratings yet
Module 3
41 pages
U5@-Data Reduction
No ratings yet
U5@-Data Reduction
22 pages
Dim Reduction & Pattern Recognition
No ratings yet
Dim Reduction & Pattern Recognition
63 pages
Pca
No ratings yet
Pca
30 pages
Pca
No ratings yet
Pca
17 pages
MAT 211_7
No ratings yet
MAT 211_7
14 pages
Ai & ML Week-9
No ratings yet
Ai & ML Week-9
30 pages
Principal Component Analysis and Cluster Analysis
No ratings yet
Principal Component Analysis and Cluster Analysis
14 pages
IDS 4 (Week 14)
No ratings yet
IDS 4 (Week 14)
66 pages
PCA_Explained -
No ratings yet
PCA_Explained -
9 pages
CHBE413CDS Lecture 12 Unsupervised DimRed
No ratings yet
CHBE413CDS Lecture 12 Unsupervised DimRed
30 pages
Principal Component Analysis PCA in Machine Learning
No ratings yet
Principal Component Analysis PCA in Machine Learning
20 pages
Principal Component Analysis (PCA) in Machine Learning
No ratings yet
Principal Component Analysis (PCA) in Machine Learning
20 pages
pca1
No ratings yet
pca1
3 pages
3.2 Pca
No ratings yet
3.2 Pca
27 pages
PCA
100% (1)
PCA
33 pages
Pattern Recognition Techniques
No ratings yet
Pattern Recognition Techniques
13 pages
Dimensionality Reduction Algorithms
No ratings yet
Dimensionality Reduction Algorithms
7 pages
PCA- PRINCIPAL COMPONENT ANALYSIS 1233
No ratings yet
PCA- PRINCIPAL COMPONENT ANALYSIS 1233
30 pages
program-3
No ratings yet
program-3
7 pages
Ai ( PCA)
No ratings yet
Ai ( PCA)
3 pages
03 Dimensionality Reduction
No ratings yet
03 Dimensionality Reduction
38 pages
5. Dimensionality Reduction
No ratings yet
5. Dimensionality Reduction
47 pages
ML Mod 4 Part 2
No ratings yet
ML Mod 4 Part 2
32 pages
The Intuition Behind PCA: Machine Learning Assignment
No ratings yet
The Intuition Behind PCA: Machine Learning Assignment
11 pages
ML Module 6
No ratings yet
ML Module 6
6 pages
Face Recognition Using PCA
No ratings yet
Face Recognition Using PCA
8 pages
Updated Lecture 13 Zainab
No ratings yet
Updated Lecture 13 Zainab
17 pages
PCA Finds Representation Through Linear Transformation
No ratings yet
PCA Finds Representation Through Linear Transformation
28 pages
Principal+Component+Analysis
No ratings yet
Principal+Component+Analysis
6 pages
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
No ratings yet
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
19 pages
Unit 4 Dimenstionality Reduction
No ratings yet
Unit 4 Dimenstionality Reduction
104 pages
Principle Component Analysis
No ratings yet
Principle Component Analysis
4 pages
Week12_PCA_BayesianInference_before_lecture
No ratings yet
Week12_PCA_BayesianInference_before_lecture
82 pages
cheat sheet
No ratings yet
cheat sheet
2 pages
Project LA
No ratings yet
Project LA
13 pages
P-3.1.4 - Pca
No ratings yet
P-3.1.4 - Pca
44 pages
chapter3 (2)
No ratings yet
chapter3 (2)
36 pages
Deep Learning for Data Analytics 2023 Answer
No ratings yet
Deep Learning for Data Analytics 2023 Answer
6 pages
Cheat Sheet: - PCA Dimensionality Reduction
No ratings yet
Cheat Sheet: - PCA Dimensionality Reduction
1 page
Feature Extraction Techniques
No ratings yet
Feature Extraction Techniques
32 pages
Module 3 ML
No ratings yet
Module 3 ML
19 pages
PCA Clearly Explained -When, Why, How To Use It and Feature Importance_ A Guide in Python _ by Serafeim Loukas _ Towards AI
No ratings yet
PCA Clearly Explained -When, Why, How To Use It and Feature Importance_ A Guide in Python _ by Serafeim Loukas _ Towards AI
19 pages
Strategic Balancing Using Factual Data
From Everand
Strategic Balancing Using Factual Data
Abhinav Aggarwal
No ratings yet
Mapping the Spatial Distribution of Poverty Using Satellite Imagery in Thailand
From Everand
Mapping the Spatial Distribution of Poverty Using Satellite Imagery in Thailand
Asian Development Bank
No ratings yet
Brochure Cannakit 5994 1639en Agilent
No ratings yet
Brochure Cannakit 5994 1639en Agilent
8 pages
Mil STD 810c
No ratings yet
Mil STD 810c
258 pages
Daily Lesson Plan in Homeroom Guidance
No ratings yet
Daily Lesson Plan in Homeroom Guidance
12 pages
StageV_430_55kW_TM_V3.1
No ratings yet
StageV_430_55kW_TM_V3.1
2 pages
Multiphase Reactors: CPE624 Faculty of Chemical Engineering
No ratings yet
Multiphase Reactors: CPE624 Faculty of Chemical Engineering
38 pages
Fate of The Furious Script
No ratings yet
Fate of The Furious Script
51 pages
Strategic Theory for the 21st Century The Little Book on Big Strategy 1st edition by Harry Yarger ISBN 1300039264 978-1300039266 - Download the full ebook set with all chapters in PDF format
100% (7)
Strategic Theory for the 21st Century The Little Book on Big Strategy 1st edition by Harry Yarger ISBN 1300039264 978-1300039266 - Download the full ebook set with all chapters in PDF format
77 pages
Jig and Fixtures
No ratings yet
Jig and Fixtures
38 pages
Phraphrase
No ratings yet
Phraphrase
10 pages
Cellular Network (-Paul Bedell)
No ratings yet
Cellular Network (-Paul Bedell)
363 pages
DIMENSIONS, UNITS and PROCESS VARIABLES
100% (2)
DIMENSIONS, UNITS and PROCESS VARIABLES
81 pages
Uc Stm32H747 Oscillators Oscillators - Schdoc: Nli2C2 Nli2C2
No ratings yet
Uc Stm32H747 Oscillators Oscillators - Schdoc: Nli2C2 Nli2C2
8 pages
Registers in 8051
No ratings yet
Registers in 8051
4 pages
ROADMAP TO VLSI Career
67% (3)
ROADMAP TO VLSI Career
15 pages
Many students choose to take a gap year before starting university
No ratings yet
Many students choose to take a gap year before starting university
4 pages
H405 Pipe Surge Datasheet
No ratings yet
H405 Pipe Surge Datasheet
5 pages
Green Leaf Resume Template
No ratings yet
Green Leaf Resume Template
3 pages
Statement 15-Jul-22 Ac 20716049-2
No ratings yet
Statement 15-Jul-22 Ac 20716049-2
7 pages
Numerical Differentiation
100% (1)
Numerical Differentiation
3 pages
Dialogue Among Civilizations
100% (2)
Dialogue Among Civilizations
6 pages
Jan 2020 1B PDF
No ratings yet
Jan 2020 1B PDF
32 pages
Engineering Optimization Course Tugas Kelompok Ke-1 Minggu 05 / Session 06
No ratings yet
Engineering Optimization Course Tugas Kelompok Ke-1 Minggu 05 / Session 06
7 pages
7307MECH Ad-FEA Project 2024-25
No ratings yet
7307MECH Ad-FEA Project 2024-25
8 pages
Overview of Horizontal Directional Drilling For Utility Construction
100% (1)
Overview of Horizontal Directional Drilling For Utility Construction
167 pages
Case Study About Evaluation
No ratings yet
Case Study About Evaluation
10 pages
Dart Assignment 1
No ratings yet
Dart Assignment 1
9 pages
Acr Escoda Day
No ratings yet
Acr Escoda Day
6 pages
TOOL_FOR_RESILIENT_MENTAL_HEALTH__1735467160
No ratings yet
TOOL_FOR_RESILIENT_MENTAL_HEALTH__1735467160
24 pages