PCA_dev
PCA_dev
By
Youssef Laasry
Mohammed Amine Tannaoui
Nada Mourhrib
1 Introduction 1
i
Chapter 1
Introduction
In the realm of data analysis and machine learning, the challenge of navigating high-
dimensional datasets is a ubiquitous and formidable one. Whether it be in fields such
as image processing, genetics, or finance, the curse of dimensionality poses significant
obstacles to comprehension, visualization, and modeling. This is where Principal Com-
ponent Analysis (PCA) steps in as a powerful technique, offering a systematic approach
to reduce dimensionality while preserving essential information.
1
CHAPTER 1. INTRODUCTION
• Correlation: It signifies that how strongly two variables are related to each other.
Such as if one changes, the other variable also gets changed. The correlation value
ranges from -1 to +1. Here, -1 occurs if variables are inversely proportional to
each other, and +1 indicates that variables are directly proportional to each other.
2
CHAPTER 1. INTRODUCTION
• Orthogonal: It defines that variables are not correlated to each other, and hence
the correlation between the pair of variables is zero.
3
Chapter 2
As described above, the transformed new features or the output of PCA are the Princi-
pal Components. The number of these PCs are either equal to or less than the original
features present in the dataset. Some properties of these principal components are given
below:
• The principal component must be the linear combination of the original features.
• These components are orthogonal, i.e., the correlation between a pair of variables
is zero.
4
Chapter 3
• Getting the dataset : Firstly, we need to take the input dataset and divide it
into two subparts X and Y, where X is the training set, and Y is the validation
set.
• Representing data into a structure : Now we will represent our dataset into
a structure. Such as we will represent the two-dimensional matrix of indepen-
dent variable X. Here each row corresponds to the data items, and the column
corresponds to the Features. The number of columns is the dimensions of the
dataset.
xi − µ
Zi =
σ
The most common way of standardizing the data using a z-score is by using the
above formula. After the standardization, the data of all the features will be
transformed to the same scale.
5
CHAPTER 3. STEPS FOR PCA ALGORITHM
• The covariance matrix : The covariance matrix is a square matrix that has
variance of the data and covariance between variables. It measures how each
variable is associated with one another using a covariance matrix.In other words,
it provides the empirical description of the data or shows how features in the data
are correlated.
• Calculating the Eigen Values and Eigen Vectors : Now we need to calcu-
late the eigenvalues and eigenvectors for the resultant covariance matrix Z. The
eigenvector gives the direction of the spread of the data or the highest variance
of the data. They are called right vectors too as it is column vectors. Whereas
eigenvalues give the relative importance of these directions. Finding the eigen-
vectors and eigenvalues of the covariance matrix is the equivalent of fitting those
straight, principal-component lines to the variance of the data. The eigenvector,
v, and eigenvalues, lamdas can be defined using the bellow equation. A can be
any square matric.
(A − λI)v = 0
6
CHAPTER 3. STEPS FOR PCA ALGORITHM
Figure 3.2: An example of eigenvectors and eigenvalues of the dataset with two variables
A and B
shown in the above table. Eigenvectors are our principal components and eigen-
values give the relative importance of those components. All eigenvectors will be
perpendicular (orthogonal) to the one calculated before it. That is why we can say
that that each of the principal components will be uncorrelated or independent
from one another.
In this example, we have 4 features and there will be four principal components.
After sorting, we select 2 out of 4 as it has the highest eigenvalues which account
for the highest spread of the data, so the feature vector will be reduced with
two columns as shown in the above table. it means that we have two principal
components that can describe most of the data.
• Project the data onto the selected principal components for dimen-
sionality reduction : Now, we have selected 2 principal components out of 4
components (for example). Now we can reduce the dimensionality of the data by
applying the following formula. Final Data Set= Standardized Original Data Set
* FeatureVector
• Remove less or unimportant features from the new dataset: The new
7
CHAPTER 3. STEPS FOR PCA ALGORITHM
Figure 3.3: Visual representation of principal components one and two that fit on a
dataset
feature set has occurred, so we will decide here what to keep and what to remove.
It means, we will only keep the relevant or important features in the new dataset,
and unimportant features will be removed out.
8
Chapter 4
Applications of Principal
Component Analysis
• In healthcare data to explore the factors that are assumed to be very important
in increasing the risk of any chronic disease.
• You can also use Principal Component Analysis to analyze patterns when you are
dealing with high-dimensional data sets.
9
4.3. DISADVANTAGES
CHAPTER 4. APPLICATIONS
OF PRINCIPAL
OFCOMPONENT
PRINCIPAL COMPONENT
ANALYSIS : ANALYSIS
• You may face some difficulties in calculating the covariances and covariance ma-
trices.
10
Chapter 5
import numpy as np
11
CHAPTER 5. IMPLEMENTATION FO THE PCA
return pca_result
#Perform PCA
result = pca(data, num_components)
print("Original data:")
print(data)
print("\nPCA result:")
print(result)
12
Conclusion
In conclusion, this assignment has covered the essentials of Principal Component Anal-
ysis (PCA) and featured a practical implementation from scratch. We started by ex-
ploring the theoretical foundations, emphasizing the significance of eigenvectors, eigen-
values, and covariance matrices in PCA.
The step-by-step implementation clarified the algorithmic processes involved in di-
mensionality reduction. By computing principal components and showcasing their ap-
plication, we gained a deeper understanding of PCA’s role in extracting meaningful
features from datasets.
Through experiments and visualizations, we observed PCA’s efficacy in retaining the
essential variance of the data while reducing dimensionality. The results underscored
its usefulness in various domains where interpreting and visualizing complex datasets
are paramount.
In summary, this assignment aimed to provide a clear and practical insight into
PCA, fostering a foundational understanding of its principles and applications in data
analysis.
13
CHAPTER 5. IMPLEMENTATION FO THE PCA
Bibliography
Principal Component Analysis(PCA) article on Medium
Principal Component Analysis(PCA) in javapoint.com
A Step-By-Step Complete Guide to Principal Component Analysis — PCA for
Beginners on Turing
Principal Component Analysis on Wikipedia
14