Dimensionality Reduction: Motivation I: Data Compression
Dimensionality Reduction: Motivation I: Data Compression
Dimensionality
Reduction
Motivation I:
Data Compression
Machine Learning
Data Compression
Reduce data from
(inches)
2D to 1D
(cm)
Andrew Ng
1
20/04/2021
Data Compression
Reduce data from
(inches)
2D to 1D
(cm)
Andrew Ng
Data Compression
Reduce data from 3D to 2D
Andrew Ng
2
20/04/2021
Andrew Ng
Dimensionality
Reduction
Motivation II:
Data Visualization
Machine Learning
3
20/04/2021
Data Visualization
Mean
Per capita Poverty household
GDP GDP Human Index income
(trillions of (thousands Develop- Life (Gini as (thousands
Country US$) of intl. $) ment Index expectancy percentage) of US$) …
Canada 1.577 39.17 0.908 80.7 32.6 67.293 …
China 5.878 7.54 0.687 73 46.9 10.22 …
India 1.632 3.41 0.547 64.7 36.8 0.735 …
Russia 1.48 19.84 0.755 65.5 39.9 0.72 …
Singapore 0.223 56.69 0.866 80 42.5 67.1 …
USA 14.527 46.86 0.91 78.3 40.8 84.3 …
… … … … … … …
[resources from en.wikipedia.org] Andrew Ng
Data Visualization
Country
Canada 1.6 1.2
China 1.7 0.3
India 1.6 0.2
Russia 1.4 0.5
Singapore 0.5 1.7
USA 2 1.5
… … …
Andrew Ng
4
20/04/2021
Data Visualization
Andrew Ng
Andrew Ng
5
20/04/2021
Dimensionality
Reduction
Principal Component
Analysis problem
formulation
Machine Learning (PCA is the generalized algo. that is used for dimensionality reduction)
Andrew Ng
6
20/04/2021
Andrew Ng
PCA is not linear regression (PCA tries to minimize (squared) projection error not MSE)
Andrew Ng
7
20/04/2021
Andrew Ng
Dimensionality
Reduction
Principal Component
Analysis algorithm
Machine Learning
8
20/04/2021
Data preprocessing
Training set:
Preprocessing (feature scaling/mean normalization):
Andrew Ng
Andrew Ng
9
20/04/2021
Andrew Ng
Andrew Ng
10
20/04/2021
Andrew Ng
Andrew Ng
11
20/04/2021
Andrew Ng
[U,S,V] = svd(Sigma);
Ureduce = U(:,1:k);
z = Ureduce’*x;
Andrew Ng
12
20/04/2021
Andrew Ng
Dimensionality
Reduction
Reconstruction from
compressed
representation
Machine Learning
13
20/04/2021
Andrew Ng
Andrew Ng
14
20/04/2021
Dimensionality
Reduction
Choosing the number of
principal components
Machine Learning
(1%)
Andrew Ng
15
20/04/2021
Check if
Andrew Ng
Andrew Ng
16
20/04/2021
Andrew Ng
Dimensionality
Reduction
Advice for
applying PCA
Machine Learning
17
20/04/2021
Extract inputs:
Unlabeled dataset:
Application of PCA
- Compression
- Reduce memory/disk needed to store data
- Speed up learning algorithm
- Visualization
Andrew Ng
18
20/04/2021
Andrew Ng
19
20/04/2021
Andrew Ng
PCA Code
from sklearn.Decomposition import PCA
PCAcom = 2
pca = PCA()
pca.n_components = PCAcom
pca_data = pca.fit_transform(sample_data)
pca_df = pd.DataFrame(data=pca_data)
40
Andrew Ng
20
20/04/2021
41
Andrew Ng
42
Andrew Ng
21
20/04/2021
43
Andrew Ng
44
Andrew Ng
22
20/04/2021
45
Andrew Ng
46
Andrew Ng
23
20/04/2021
47
Andrew Ng
PCA Projection to 2D
• The original data has 4 columns (sepal length, sepal
width, petal length, and petal width)
• In this section, the code projects the original data
which is 4 dimensional into 2 dimensions
• The new components are just the two main
dimensions of variation.
48
Andrew Ng
24
20/04/2021
PCA Projection to 2D
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data =
principalComponents, columns = ['principal
component 1', 'principal component 2'])
49
Andrew Ng
PCA Projection to 2D
d
50
Andrew Ng
25
20/04/2021
PCA Projection to 2D
finalDf = pd.concat([principalDf, df[['target']]], axis = 1)
51
Andrew Ng
Visualize 2D Projection
• This section is just plotting 2 dimensional data.
Notice on the graph below that the classes seem
well separated from each other.
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
52
Andrew Ng
26
20/04/2021
Visualize 2D Projection
targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
indicesToKeep = finalDf['target'] == target
ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
, finalDf.loc[indicesToKeep, 'principal component 2']
, c = color
, s = 50)
ax.legend(targets)
ax.grid()
53
Andrew Ng
Visualize 2D Projection
d
54
Andrew Ng
27
20/04/2021
Explained Variance
• The explained variance tells you how much
information (variance) can be attributed to each of
the principal components.
• This is important as while you can convert 4
dimensional space to 2 dimensional space, you lose
some of the variance (information) when you do
this
55
Andrew Ng
Explained Variance
• By using the attribute explained_variance_ratio_, you
can see that the first principal component contains
72.77% of the variance and the second principal
component contains 23.03% of the variance.
56
Andrew Ng
28
20/04/2021
57
Andrew Ng
58
Andrew Ng
29
20/04/2021
59
Andrew Ng
60
Andrew Ng
30
20/04/2021
61
Andrew Ng
62
Andrew Ng
31
20/04/2021
63
Andrew Ng
64
Andrew Ng
32
20/04/2021
65
Andrew Ng
66
Andrew Ng
33
20/04/2021
67
Andrew Ng
68
Andrew Ng
34
20/04/2021
69
Andrew Ng
35