0% found this document useful (0 votes)
7 views

PCA_dev

This document presents a comprehensive overview of Principal Component Analysis (PCA), detailing its theoretical foundations, algorithmic steps, and practical applications in data analysis and machine learning. It discusses the advantages and disadvantages of PCA, including its effectiveness in dimensionality reduction and challenges in interpretation. The document also includes a Python implementation of PCA, demonstrating its application with sample data.

Uploaded by

salma beauty
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

PCA_dev

This document presents a comprehensive overview of Principal Component Analysis (PCA), detailing its theoretical foundations, algorithmic steps, and practical applications in data analysis and machine learning. It discusses the advantages and disadvantages of PCA, including its effectiveness in dimensionality reduction and challenges in interpretation. The document also includes a Python implementation of PCA, demonstrating its application with sample data.

Uploaded by

salma beauty
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Principal component analysis

Data mining project

By

Youssef Laasry
Mohammed Amine Tannaoui
Nada Mourhrib

Supervisor: Pr. Hosni

Wednesday 13th March, 2024


Contents

1 Introduction 1

2 Principal Components in PCA 4

3 Steps for PCA algorithm 5

4 Applications of Principal Component Analysis 9


4.1 Applications of PCA : . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Advantages of Principal Component Analysis: . . . . . . . . . . . . . . 9
4.3 Disadvantages of Principal Component Analysis : . . . . . . . . . . . . 10

5 Implementation fo the PCA 11

i
Chapter 1

Introduction
In the realm of data analysis and machine learning, the challenge of navigating high-
dimensional datasets is a ubiquitous and formidable one. Whether it be in fields such
as image processing, genetics, or finance, the curse of dimensionality poses significant
obstacles to comprehension, visualization, and modeling. This is where Principal Com-
ponent Analysis (PCA) steps in as a powerful technique, offering a systematic approach
to reduce dimensionality while preserving essential information.

’Step by step, Principal Component Analysis unveils the hidden layers of


data complexity, simplifying the intricate to reveal the essential.’

Principal Component Analysis is an unsupervised learning algorithm that is used for


the dimensionality reduction in machine learning. It is a statistical process that converts
the observations of correlated features into a set of linearly uncorrelated features with
the help of orthogonal transformation. These new transformed features are called the
Principal Components. It is one of the popular tools that is used for exploratory data
analysis and predictive modeling. It is a technique to draw strong patterns from the
given dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.
PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.

1
CHAPTER 1. INTRODUCTION

Figure 1.1: The figure shows the reduction of features from 3 to 1.

Some real-world applications of PCA are image processing, movie recommendation


system, optimizing the power allocation in various communication channels. It is a
feature extraction technique, so it contains the important variables and drops the least
important variable.
The PCA algorithm is based on some mathematical concepts such as:

• Variance and Covariance

• Eigenvalues and Eigen factors

Some common terms used in PCA algorithm:

• Dimensionality: It is the number of features or variables present in the given


dataset. More easily, it is the number of columns present in the dataset.

• Correlation: It signifies that how strongly two variables are related to each other.
Such as if one changes, the other variable also gets changed. The correlation value
ranges from -1 to +1. Here, -1 occurs if variables are inversely proportional to
each other, and +1 indicates that variables are directly proportional to each other.

2
CHAPTER 1. INTRODUCTION

• Orthogonal: It defines that variables are not correlated to each other, and hence
the correlation between the pair of variables is zero.

• Eigenvectors: If there is a square matrix M, and a non-zero vector v is given.


Then v will be eigenvector if Av is the scalar multiple of v.

• Covariance Matrix: A matrix containing the covariance between the pair of


variables is called the Covariance Matrix.

3
Chapter 2

Principal Components in PCA

As described above, the transformed new features or the output of PCA are the Princi-
pal Components. The number of these PCs are either equal to or less than the original
features present in the dataset. Some properties of these principal components are given
below:

• The principal component must be the linear combination of the original features.

• These components are orthogonal, i.e., the correlation between a pair of variables
is zero.

• The importance of each component decreases when going to 1 to n, it means the


1 PC has the most importance, and n PC will have the least importance.

4
Chapter 3

Steps for PCA algorithm

• Getting the dataset : Firstly, we need to take the input dataset and divide it
into two subparts X and Y, where X is the training set, and Y is the validation
set.

• Representing data into a structure : Now we will represent our dataset into
a structure. Such as we will represent the two-dimensional matrix of indepen-
dent variable X. Here each row corresponds to the data items, and the column
corresponds to the Features. The number of columns is the dimensions of the
dataset.

• Standardization : Firstly, we need to standardize the data with a mean and


standard deviation. It is because PCA works under the assumption that the
data is normally distributed, and is very sensitive to the variance of the features
or variables. It means that large differences between the ranges of features will
dominate over those with small ranges.

xi − µ
Zi =
σ

The most common way of standardizing the data using a z-score is by using the
above formula. After the standardization, the data of all the features will be
transformed to the same scale.

5
CHAPTER 3. STEPS FOR PCA ALGORITHM

Figure 3.1: An example of a covariance matrix

• The covariance matrix : The covariance matrix is a square matrix that has
variance of the data and covariance between variables. It measures how each
variable is associated with one another using a covariance matrix.In other words,
it provides the empirical description of the data or shows how features in the data
are correlated.

As you noticed, it is a collection of variance and covariance. In reality, it is a


collection of covariance between two features, but the covariance of two same
features is called variance. For instance covar(f1,f1) = var(f1).

• Calculating the Eigen Values and Eigen Vectors : Now we need to calcu-
late the eigenvalues and eigenvectors for the resultant covariance matrix Z. The
eigenvector gives the direction of the spread of the data or the highest variance
of the data. They are called right vectors too as it is column vectors. Whereas
eigenvalues give the relative importance of these directions. Finding the eigen-
vectors and eigenvalues of the covariance matrix is the equivalent of fitting those
straight, principal-component lines to the variance of the data. The eigenvector,
v, and eigenvalues, lamdas can be defined using the bellow equation. A can be
any square matric.

(A − λI)v = 0

• Calculating the new features Or Principal Components : Once we com-


pute the eigenvectors based on the eigenvalues, we can form the feature vector as

6
CHAPTER 3. STEPS FOR PCA ALGORITHM

Figure 3.2: An example of eigenvectors and eigenvalues of the dataset with two variables
A and B

shown in the above table. Eigenvectors are our principal components and eigen-
values give the relative importance of those components. All eigenvectors will be
perpendicular (orthogonal) to the one calculated before it. That is why we can say
that that each of the principal components will be uncorrelated or independent
from one another.

The principal components (eigenvectors) are sorted by descending eigenvalue.


Then, the principal component with the highest eigenvalue is our first principal
component as it accounts for the highest variance or spread of the data .

In this example, we have 4 features and there will be four principal components.
After sorting, we select 2 out of 4 as it has the highest eigenvalues which account
for the highest spread of the data, so the feature vector will be reduced with
two columns as shown in the above table. it means that we have two principal
components that can describe most of the data.

• Project the data onto the selected principal components for dimen-
sionality reduction : Now, we have selected 2 principal components out of 4
components (for example). Now we can reduce the dimensionality of the data by
applying the following formula. Final Data Set= Standardized Original Data Set
* FeatureVector

• Remove less or unimportant features from the new dataset: The new

7
CHAPTER 3. STEPS FOR PCA ALGORITHM

Figure 3.3: Visual representation of principal components one and two that fit on a
dataset

feature set has occurred, so we will decide here what to keep and what to remove.
It means, we will only keep the relevant or important features in the new dataset,
and unimportant features will be removed out.

8
Chapter 4

Applications of Principal
Component Analysis

4.1 Applications of PCA :


• PCA is mainly used as the dimensionality reduction technique in various AI ap-
plications such as computer vision, image compression, etc.

• PCA in machine learning is used to visualize multidimensional data.

• In healthcare data to explore the factors that are assumed to be very important
in increasing the risk of any chronic disease.

• PCA helps to resize an image.

• PCA is used to analyze stock data and forecasting data.

• You can also use Principal Component Analysis to analyze patterns when you are
dealing with high-dimensional data sets.

4.2 Advantages of Principal Component Analysis:


• Easy to calculate and compute.

9
4.3. DISADVANTAGES
CHAPTER 4. APPLICATIONS
OF PRINCIPAL
OFCOMPONENT
PRINCIPAL COMPONENT
ANALYSIS : ANALYSIS

• Speeds up machine learning computing processes and algorithms.

• Prevents predictive algorithms from data overfitting issues.

• Increases performance of ML algorithms by eliminating unnecessary correlated


variables.

• Principal Component Analysis results in high variance and increases visualization.

• Helps reduce noise that cannot be ignored automatically.

4.3 Disadvantages of Principal Component Analy-


sis :
• Sometimes, PCA is difficult to interpret. In rare cases, you may feel difficult to
identify the most important features even after computing the principal compo-
nents.

• You may face some difficulties in calculating the covariances and covariance ma-
trices.

• Sometimes, the computed principal components can be more difficult to read


rather than the original set of components.

10
Chapter 5

Implementation fo the PCA

import numpy as np

def pca(X, num_components):

# First we need to do mean centering :


mean_X = np.mean(X, axis=0)
X_centered = X - mean_X

#Then we need to calculate the covariance matrix :


covariance_matrix = np.cov(X_centered, rowvar=False)

#Then we calculate eigenvectors and eigenvalues :


eigenvalues, eigenvectors = np.linalg.eigh(covariance_matrix)

#Sort eigenvectors based on eigenvalues (descending order)


sorted_indices = np.argsort(eigenvalues)[::-1]
eigenvectors = eigenvectors[:, sorted_indices]

11
CHAPTER 5. IMPLEMENTATION FO THE PCA

#Select the top ’num_components’ eigenvectors


top_eigenvectors = eigenvectors[:, :num_components]

#Finally we project the data onto the principal components


pca_result = np.dot(X_centered, top_eigenvectors)

return pca_result

#Here is an example of usage :

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

#Specifying the number of principal components to retain


num_components = 2

#Perform PCA
result = pca(data, num_components)

print("Original data:")
print(data)
print("\nPCA result:")
print(result)

12
Conclusion

In conclusion, this assignment has covered the essentials of Principal Component Anal-
ysis (PCA) and featured a practical implementation from scratch. We started by ex-
ploring the theoretical foundations, emphasizing the significance of eigenvectors, eigen-
values, and covariance matrices in PCA.
The step-by-step implementation clarified the algorithmic processes involved in di-
mensionality reduction. By computing principal components and showcasing their ap-
plication, we gained a deeper understanding of PCA’s role in extracting meaningful
features from datasets.
Through experiments and visualizations, we observed PCA’s efficacy in retaining the
essential variance of the data while reducing dimensionality. The results underscored
its usefulness in various domains where interpreting and visualizing complex datasets
are paramount.
In summary, this assignment aimed to provide a clear and practical insight into
PCA, fostering a foundational understanding of its principles and applications in data
analysis.

13
CHAPTER 5. IMPLEMENTATION FO THE PCA

Bibliography
Principal Component Analysis(PCA) article on Medium
Principal Component Analysis(PCA) in javapoint.com
A Step-By-Step Complete Guide to Principal Component Analysis — PCA for
Beginners on Turing
Principal Component Analysis on Wikipedia

14

You might also like