0% found this document useful (0 votes)
3 views25 pages

Principal Computer Analysis(PCA)

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction, simplifying complex datasets while retaining essential information. It involves standardizing data, calculating a covariance matrix, and identifying principal components through eigenvalues and eigenvectors. PCA has various applications, advantages such as preventing overfitting, and limitations including linearity assumptions and potential loss of information.

Uploaded by

kaah22ise
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views25 pages

Principal Computer Analysis(PCA)

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction, simplifying complex datasets while retaining essential information. It involves standardizing data, calculating a covariance matrix, and identifying principal components through eigenvalues and eigenvectors. PCA has various applications, advantages such as preventing overfitting, and limitations including linearity assumptions and potential loss of information.

Uploaded by

kaah22ise
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

PRINCIPAL

COMPONENT
ANALYSIS(PCA)
What is PCA?

Dimensionality Reduction

Why PCA?

OUTLINE Important Terminologies

How Does PCA Work

Application of USA

Advantages and Disadvantages


INTRODUCTION
Principal Component Analysis, commonly referred to as
PCA, is a powerful mathematical technique used in data
analysis and statistics. At its core, PCA is designed to
simplify complex datasets by transforming them into a more
manageable form while retaining the most critical
information.

• reducing the dimensionality of dataset

• Increasing interpretability without losing information


Dimensionality Reduction

1 2 3 4 5 6 7

Dimensionality Why DR? Less Redundancy is Data It helps to find Leads to better
reduction dimensions for removed after Compression out the most human
refers to the a given dataset removing (Reduce significant interpretations
techniques that means less similar entries storage space) features and
reduce the computation or from the skip the rest
number of training time dataset
input variables
in a dataset.
WHY PCA?

DIMENSIONALIT NOISE VISUALIZATION FEATURE OVERFITTING DATA MACHINE


Y REDUCTION REDUCTION ENGINEERING PROBLEM COMPRESSION LEARNING
PROCESSING
Variance

Covariance
IMPORTANT
TERMINOLOGIES Eigenvalues

Eigenvectors

Principle Component
IMPORTANT TERMINOLOGIES (VARIANCE)

• Variance is the sum of squares of differences between all numbers and


means
• Variance (σ²) = (Sum of the squared differences from the mean) / (Total
number of values)
• In mathematical notation: σ² = Σ(x - μ)² / (n)
Here:
• μ is the mean of independent features
• Mean (μ) = (Sum of all values) / (Total number of values)
IMPORTANT TERMINOLOGIES (VARIANCE)

• The variance is a measure that indicates how much data scatter around the
mean
IMPORTANT TERMINOLOGIES (VARIANCE)

• In mathematical notation: σ² = Σ(x - μ)² / (n)


Compute Eigenvalues/EigenVectors

Let A be a square N*N matrix & x be a non-zero vector for which:


Ax = λx

For some scalar values λ


λ = Eigenvalue of matrix A.
X = Eigenvector of matrix A.
Eigenvalues:
A-λI = 0 [return n numbers of eigenvalues
HOW DOES PCA WORKS

• Step 1: Standardize the data.


• Step 2: Calculate the
covariance matrix.
• Step 3: Compute the
eigenvectors and eigenvalues.
• Step 4: Select the principal
components.
• Step 5: Project data onto the
new basis.
Step-By-Step Explanation of PCA (Principal Component
Analysis)

Step 1: Standardization
The main aim of this step is to standardize the range of the attributes so that each one of them lie within
similar boundaries

• Z = (x - μ) / σ

• μ is the mean of independent features

• σ is the standard deviation of independent features

• σ = √[ Σ(x - x̄)² / N ]
STANDARDIZATION
Dataset:
Consider a small dataset with two variables, X and Y, represented by the following data points:
• X: [2, 3, 5, 7, 10]
• Y: [4, 5, 7, 8, 11]

For variable X:
• Mean (μX) = (2 + 3 + 5 + 7 + 10) / 5 = 5.4
• Standard Deviation (σX) = √[Σ(Xi - μX)² / (n - 1)] = √[(0.64 + 0.04 + 0.16 + 1.44 + 20.25) / 4] = 2.40

For variable Y:
• Mean (μY) = (4 + 5 + 7 + 8 + 11) / 5 = 7
• Standard Deviation (σY) = √[Σ(Yi - μY)² / (n - 1)] = √[(9 + 4 + 0 + 1 + 16) / 4] = 2.38
• Standardized X: [-1.25, -0.71, 0.36, 1.43, 0.17]
• Standardized Y: [-1.34, -0.87, 0.11, 0.61, 1.50]
Covariance Matrix Computation
Covariance matrix is used to express the correlation between any two or more attributes in
a multidimensional dataset

•Variance is denoted by Var


•Covariance is denoted by Cov
COVARIANCE MATRIX COMPUTATION

Cov(X, X) Cov(X, Y)
Cov(Y, X) Cov(Y, Y)
• Using the formula for covariance:

Cov(X, X) = Σ(Standardized X * Standardized X) / (n - 1) = (1.56 + 0.50 + 0.13 + 2.05 + 0.03) / 4 = 1.305


Cov(X, Y) = Σ(Standardized X * Standardized Y) / (n - 1) = (-1.67 + 0.62 + 0.04 + 0.88 + 0.26) / 4 = 0.133
Cov(Y, X) = Σ(Standardized Y * Standardized X) / (n - 1) = (-1.67 + 0.62 + 0.04 + 0.88 + 0.26) / 4 = 0.133
Cov(Y, Y) = Σ(Standardized Y * Standardized Y) / (n - 1) = (1.79 + 0.76 + 0.01 + 0.15 + 2.25) / 4 = 1.24
• Covariance Matrix:

1.305 0.133
0.133 1.24
Important Terminologies (Covariance)

It is the relationship It can take any value negative relationship It is used for the linear It gives the direction of
between a pair of between -infinity to whereas a positive value relationship between relationship between
random variables where +infinity, where the represents the positive variables. variables.
change in one variable negative value relationship.
causes change in represents the
another variable.
IMPORTANT TERMINOLOGIES
(COVARIANCE)

The formula for the covariance (Cov) between two random variables X and Y,
each with N data points, is as follows:
Cov(X, Y) = (1/N) * Σ (from i=1 to N) [(Xi - X̄ ) * (Yi - Ȳ)]
Where:
• Cov(X, Y) is the covariance between X and Y.
• N is the number of data points.
• Xi and Yi represent individual data points for X and Y, respectively.
COMPUTE EIGENVALUES AND EIGENVECTORS OF
COVARIANCE MATRIX TO IDENTIFY PRINCIPAL COMPONENTS

Let's assume we find two eigenvalues and corresponding eigenvectors:


• Eigenvalue 1 (λ1) = 1.50
• Eigenvector 1 (v1) = [0.707, 0.707]
• Eigenvalue 2 (λ2) = 1.05
• Eigenvector 2 (v2) = [-0.707, 0.707]
SELECT THE PRINCIPAL COMPONENT

1.First Principle component is the


direction of greatest
variability(covariance) in the data
2.Second is the next
orthogonal(uncorrelated) direction of
greatest variability
PROJECT DATA ON TWO PRINCIPAL
COMPONENTS

• To transform the data into the new principal component space, we


dot-multiply the standardized data by the eigenvectors:
• New PC1 = (Standardized X * v1, Standardized Y * v1)
• New PC2 = (Standardized X * v2, Standardized Y * v2)
APPLICATIONS OF PCA

Netflix Movie
Grocery
Recommendation Fitness Trackers Car Shopping Real Estate
Shopping
s

Manufacturing
Renewable
and Quality Sports Analytics Smart Cities
Energy
Control
Advantages of PCA

Prevents Overfitting
Speeds Up Other Machine Learning
Algorithms
Improves Visualization

Dimensionality Reduction

Noise Reduction
LIMITATIONS OF PCA

Linearity Assumption

Loss of Interpretability

Loss of Information

Sensitivity to Scaling

Orthogonal Components

You might also like