Principal Component Analysis
Principal Component Analysis
component analysis
STEP 1: STANDARDIZATION
• In this step, you will get to know how the variables of the given data
are varying with the mean value calculated.
• Any interrelated variables can also be sorted out at the end of this
step.
• To segregate the highly interrelated variables, you calculate the
covariance matrix with the help of the given formula.
STEP 3: FEATURE VECTOR
• To determine the principal components of variables, you have to define eigenvalue and
eigenvectors for the same. Let A be any square matrix. A non-zero vector v is an
eigenvector of A if
• Av = λv
• For some number λ, called the corresponding eigenvalue.
• Once you have computed the Eigenvector components, define Eigenvalues in descending
order (for all variables) and now you will get a list of principal components.
• So, the eigenvalues represent the principal components and these components represent
the direction of data.
• This indicates that if the line contains large variables of large variances, then there are
many data points on the line. Thus, there is more information on the line too.
• Finally, these principal components form a line of new axes for easier evaluation of data
and also the differences between the observations can also be easily monitored
STEP 4: RECAST THE DATA ALONG THE
PRINCIPAL COMPONENTS AXES
• Still now, apart from standardization, you haven’t made any changes
to the original data. You have just selected the Principal components
and formed a feature vector. Yet, the initial data remains the same on
their original axes.
• This step aims at the reorientation of data from their original axes to
the ones you have calculated from the Principal components.
• This can be done by the following formula.
1. Considering a real-time example. Let's take a situation where you have to recognize a few
patterns of good-quality apples in the food processing industry. When you have to detect
and recognize thousands of samples, you would require an algorithm to sort this out. As a
first step, all possible features are categorized as vector components and all the samples
are passed out through an algorithm (simply like a sensor that scans the samples) for
analysis.
After analyzing the bulk reports of the algorithm, you may categorize the apple samples that
are having greater variances like (Very small/ very large in size, rotten samples, damaged
samples, etc.) and at the same time, you may categorize other apple samples that are having
smaller variances like (samples with leaves or branches, samples that are not under vector
component values, etc)
Let us assume the features of dimensions as
F1= Large size apples
F2= Rotten apples
F3= Damaged apples
F4= Small apples
:
F1 F2 F3 F4
-1.0695 0.8196 0 -1
0.5347 -1.6393 1.6042 1
-1.0695 0 0 0
0.5347 0 -1.0695 -1
1.0695 0.8196 -0.5347 1
F1 F2 F3 F4
F1 VAR(F1) COV(F1,F2) COV(F1,F3) COV(F1,F4)
F2 COV(F2,F1) VAR(F2) COV(F2,F3) COV(F2,F4)
F3 COV(F3,F1) COV(F3,F2) VAR(F3) COV(F3,F4)
Similarly solving for all features, the covariance matrix will be:
F1 F2 F3 F4
F1 0.78 -0.8586 -0.055 0.424
F2 -0.8586 0.78 -0.607 -0.326
F3 -0.055 -0.607 0.78 0.426
F4 0.424 -0.326 0.426 0.78
Step 4: Find the eigenvalues and eigenvectors and finally pick up the
topmost eigenvalues to be our principal component.
• Let v be a non-zero vector and λ a scalar
• Av = λv, then λ is called the eigenvalue associated with eigenvector v
of A.
• To solve for λ: det(A-λI) = 0, we will get the following matrix
F1 F2 F3 F4
E1 of λ1 E2 of λ2 E3 of λ3 E4 of λ4
0.515514 -0.623012 0.0349815 -0.587262
-0.616625 0.113105 0.452326 -0.634336
0.399314 0.744256 -0.280906 -0.455767
0.441098 0.212477 0.845736 0.212173
Sum(0.739676) 0.446826 1.0521375 -1.465192
Seeing the sum of each Eigen column. Arrange them in descending order and pick
up the topmost Eigenvalue will be selected as our principal component which is E1
of λ1 and E2 of λ2.
Step 4: Recast the data along the axes of the principal component
•The final data set becomes: by solving using the following equation we
will get:
Seeing from the tale the largest dataset is now compressed into a small dataset
without any loss of data. This is the significance of the principal component.
Application of PCA