Principal Component Analysis
Principal Component Analysis
• Because smaller data sets are easier to explore and visualize and make analyzing data
much easier and faster for machine learning algorithms without extraneous variables to
process.
Goal of PCA:
1. Identify Patterns in Data
2. Detect the correlation between variables
These principal components are orthogonal to each other, meaning they are
uncorrelated.
The first principal component captures the largest amount of variation in the
data, followed by the second principal component, and so on.
• PCA is an unsupervised pre-processing task that is carried out before
applying any ML algorithm.
• The attribute which describes the most variance is called the first
principal component and is placed at the first coordinate.
•Similarly, the attribute which stands second in describing variance is called a
second principal component and so on. In short, the complete dataset can be
expressed in terms of principal components
3. The sum of the variance of the new features / the principal components
should be equal to the sum of the variance of the original features.
Working of PCA
Step 1: Standardize the dataset.
Step 2: Calculate the covariance matrix for the features in the dataset.
Step 3: Calculate the eigenvalues and eigenvectors for the covariance matrix.
This step ensures that all features contribute equally to the analysis and
prevents variables with larger scales from dominating the principal
components.
2 Covariance matrix: PCA computes the covariance matrix of the
standardized dataset. The covariance matrix indicates the relationships and
dependencies between pairs of features.
Assume we have the below dataset which has 4 features and a total of 5
training examples.
First, we need to standardize the dataset and for that, we need to calculate the
mean and standard deviation for each feature.
After applying the formula for each feature in the dataset is transformed
as below:
2. Calculate the covariance matrix for the whole dataset
•x[i] and y[i] are the values of x and y for each data point.
cov(f1,f2) =
((-1.0–0)*(-0.632456-0) + (0.33–0)*(1.264911-0) + (-1.0–0)* (0.632456-0)+
(0.33–0)*(0.000000 -0)+ (1.33–0)*(-1.264911–0))/4
cov(f1,f2) = -0.25298
In the similar way, we can calculate the other covariances and which will result
in the below covariance matrix
3. Calculate eigenvalues and eigen vectors.
Let A be a square matrix (in our case the covariance matrix), ν a vector and λ a scalar
that satisfies Aν = λν, then λ is called eigenvalue associated with eigenvector ν of A.
(A-λI)ν = 0
Since we have already know ν is a non- zero vector, only way this equation can be
equal to zero, if
det(A-λI) = 0
det(A-λI) = 0
eigenvectors(4 * 4 matrix)
4. Sort eigenvalues and their corresponding eigenvectors.
Since eigenvalues are already sorted in this case so no need to sort them again.
2. Loading Data
3. Apply PCA
Thus, it is clear that with PCA, the number of dimensions has reduced to 3 from 30. If we
choose n_components=2, the dimensions would be reduced to 2.
4. Check Components
We can easily see that there are three rows as n_components was chosen
to be 3. However, each row has 30 columns as in actual data.
5. Plot the components (Visualization)
Though we had taken n_components =3, here we are plotting a 2d graph as well as 3d
using first two principal components and 3 principal components respectively.
The colors show the 2 output classes of the original dataset-benign and malignant.
It is clear that principal components show clear separation between two output classes.
For three principal components, we need to plot a 3d graph. x[:,0] signifies the first
principal component. Similarly, x[:,1] and x[:,2] represent the second and the third
principal component.
Python implementation
•By the fit and transform method, the attributes are passed.
https://setosa.io/ev/principal-component-analysis/
here pca.components_ gives Principal axes in feature
space, representing the directions of maximum variance in
the data