0% found this document useful (0 votes)
68 views

Principal Component Analysis

1. Principal component analysis involves 4 main steps: standardization of data, computation of the covariance matrix, determination of eigenvectors and eigenvalues to identify principal components, and recasting of the data along the principal component axes. 2. An example uses a dataset of apple samples characterized by 4 features to demonstrate each step. 3. The first two principal components are selected to represent the data, compressing it into a smaller dataset without loss of information.

Uploaded by

bethel lemma
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Principal Component Analysis

1. Principal component analysis involves 4 main steps: standardization of data, computation of the covariance matrix, determination of eigenvectors and eigenvalues to identify principal components, and recasting of the data along the principal component axes. 2. An example uses a dataset of apple samples characterized by 4 features to demonstrate each step. 3. The first two principal components are selected to represent the data, compressing it into a smaller dataset without loss of information.

Uploaded by

bethel lemma
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Steps for principal

component analysis
STEP 1: STANDARDIZATION

• The range of variables is calculated and standardized in this process to


analyze the contribution of each variable equally.
• Calculating the initial variables will help you categorize the variables
that are dominating the other variables of small ranges.
• This will help you attain biased results at the end of the analysis.
• To transform the variables of the same standard, you can follow the
following formula.
STEP 2: COVARIANCE MATRIX COMPUTATION

• In this step, you will get to know how the variables of the given data
are varying with the mean value calculated.
• Any interrelated variables can also be sorted out at the end of this
step.
• To segregate the highly interrelated variables, you calculate the
covariance matrix with the help of the given formula.
STEP 3: FEATURE VECTOR

• To determine the principal components of variables, you have to define eigenvalue and
eigenvectors for the same. Let A be any square matrix. A non-zero vector v is an
eigenvector of A if
• Av = λv
• For some number λ, called the corresponding eigenvalue.
• Once you have computed the Eigenvector components, define Eigenvalues in descending
order (for all variables) and now you will get a list of principal components.
• So, the eigenvalues represent the principal components and these components represent
the direction of data.
• This indicates that if the line contains large variables of large variances, then there are
many data points on the line. Thus, there is more information on the line too.
• Finally, these principal components form a line of new axes for easier evaluation of data
and also the differences between the observations can also be easily monitored
STEP 4: RECAST THE DATA ALONG THE
PRINCIPAL COMPONENTS AXES
• Still now, apart from standardization, you haven’t made any changes
to the original data. You have just selected the Principal components
and formed a feature vector. Yet, the initial data remains the same on
their original axes.
• This step aims at the reorientation of data from their original axes to
the ones you have calculated from the Principal components.
• This can be done by the following formula.

Final Data Set= Standardized Original Data Set * Feature Vector


For example:

1. Considering a real-time example. Let's take a situation where you have to recognize a few
patterns of good-quality apples in the food processing industry. When you have to detect
and recognize thousands of samples, you would require an algorithm to sort this out. As a
first step, all possible features are categorized as vector components and all the samples
are passed out through an algorithm (simply like a sensor that scans the samples) for
analysis.
After analyzing the bulk reports of the algorithm, you may categorize the apple samples that
are having greater variances like (Very small/ very large in size, rotten samples, damaged
samples, etc.) and at the same time, you may categorize other apple samples that are having
smaller variances like (samples with leaves or branches, samples that are not under vector
component values, etc)
Let us assume the features of dimensions as
F1= Large size apples
F2= Rotten apples
F3= Damaged apples
F4= Small apples
:

Step 1: Find the standardized set


Calculating the mean and standard deviation

Large size apple Rotten apples Damaged apples Small apples


F1 F2 F3 F4
1 5 3 1
4 2 6 3
1 4 3 2
4 4 1 1
5 5 2 3
F1 F2 F3 F4
Mean 3 4 3 2
Standard 1.87 1.223 1.87 1
deviation
Then, after using the above standardization formula will find each variable their results below:

F1 F2 F3 F4
-1.0695 0.8196 0 -1
0.5347 -1.6393 1.6042 1
-1.0695 0 0 0
0.5347 0 -1.0695 -1
1.0695 0.8196 -0.5347 1

This is the standardized data set


Step 2: find the covariance matrix computation referring to the above covariance formula:
Since the features are standardized in the above step, we can consider Mean =0, AND
Standard deviation= 1 for all

F1 F2 F3 F4
F1 VAR(F1) COV(F1,F2) COV(F1,F3) COV(F1,F4)
F2 COV(F2,F1) VAR(F2) COV(F2,F3) COV(F2,F4)
F3 COV(F3,F1) COV(F3,F2) VAR(F3) COV(F3,F4)

F4 COV(F4,F1) COV(F4,F2) COV(F4,F3) VAR(F4)

(( 1.0695  0) 2  (0.5347  0) 2  (1.0695  0) 2  (0.5347  0) 2  (1.069  0) 2


VAR ( F1) 
5

Therefore the solution of VAR(F1) will be 0.78


((1.06950)(0.81960) (0.53470)( 1.63930) ( 1.06950)(0.00) (0.53470)(0.0 0) (1.06950)(0.81960)
COV ( F1, F 2)  5

COV(F1, F2)= -0.8586

Similarly solving for all features, the covariance matrix will be:

F1 F2 F3 F4
F1 0.78 -0.8586 -0.055 0.424
F2 -0.8586 0.78 -0.607 -0.326
F3 -0.055 -0.607 0.78 0.426
F4 0.424 -0.326 0.426 0.78
Step 4: Find the eigenvalues and eigenvectors and finally pick up the
topmost eigenvalues to be our principal component.
• Let v be a non-zero vector and λ a scalar
• Av = λv, then λ is called the eigenvalue associated with eigenvector v
of A.
• To solve for λ: det(A-λI) = 0, we will get the following matrix
F1 F2 F3 F4

F1 0.78-λ -0.8586 -0.055 0.424

F2 -0.8586 0.78-λ -0.607 -0.326

F3 -0.055 -0.607 0.78-λ 0.426

F4 0.424 -0.326 0.426 0.78-λ


When solving for λ we get
λ= 2.11691, 0.855413, 0.481689, 0.334007
Solving for eigenvector for each eigenvalue:

E1 of λ1 E2 of λ2 E3 of λ3 E4 of λ4
0.515514 -0.623012 0.0349815 -0.587262
-0.616625 0.113105 0.452326 -0.634336
0.399314 0.744256 -0.280906 -0.455767
0.441098 0.212477 0.845736 0.212173
Sum(0.739676) 0.446826 1.0521375 -1.465192

Seeing the sum of each Eigen column. Arrange them in descending order and pick
up the topmost Eigenvalue will be selected as our principal component which is E1
of λ1 and E2 of λ2.
Step 4: Recast the data along the axes of the principal component

•The final data set becomes: by solving using the following equation we
will get:

Final Data Set= Standardized Original Data Set * Feature Vector

Large Size Apples Rotten Apples


0.4268066978 0.00116114000000012
-0.4234920968 -0.39182856
-0.551342223 -0.7959219
0.4652040128 0.899996329
0.0827279480000002 0.28653037

Seeing from the tale the largest dataset is now compressed into a small dataset
without any loss of data. This is the significance of the principal component.
Application of PCA

Using principal component analysis for damage detection of


aerospace structures.
PCA in machine learning is used to visualize multidimensional data.
In healthcare data to explore the factors that are assumed to be very
important in increasing the risk of any chronic disease.
PCA helps to resize an image.
PCA is used to analyze stock data and forecasting data.
You can also use Principal Component Analysis to analyze patterns
when you are dealing with high-dimensional data sets.

You might also like