Open In App

prcomp in R

Last Updated : 30 May, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

The prcomp function serves as a great tool for PCA performance. This article is an extensive discussion of PCA using prcomp in R, which covers concepts, functions, and a true illustration of its usage.

Understanding PCA

PCA is a statistical technique that is applied to a high-dimensional dataset to reduce the dimensionality while retaining the most important information of the dataset. It can do that by changing the original variables into a new set of uncorrelated variables which are called principal components and are arranged to how much extent variance they explain.

Understanding the Role of the Prcomp Function

The prcomp option in R actually calculates the principal component which are those variables that account for the most variation in a dataset. These principal components are new variables that are constructed from the original variables are linearly independent of each other, and are ranked according to the extent of variance that they account for. This results from the fact that by projecting the original data onto these components, prcomp works to transform the high dimensionality data into a more manageable form without losing the inherent structure of the data.

The basic syntax of the prcomp function is:

Syntax: prcomp(x, center = TRUE, scale. = FALSE, rank. = NULL)

Where:

  • x: A numeric matrix or data frame to be analyzed.
  • center: A logical value indicating whether the variables should be centered to have mean zero. Default is TRUE.
  • scale.: A logical value indicating whether the variables should be scaled to have unit variance. Default is FALSE.
  • rank.: The number of principal components to retain. If NULL, all components are retained

Now we Consider a dataset containing measurements of various chemical compounds:

Compound

Feature 1

Feature 2

Feature 3

Feature 4

A

0.1

0.2

0.5

0.4

B

0.3

0.6

0.2

0.8

C

0.4

0.1

0.9

0.3

D

0.7

0.5

0.4

0.6

Implementation of PCA with prcomp

Let's implement PCA using the prcomp function in R, using the following code:

Step 1. Load the library

We load the stats library that is divided into the core R distribution. This library utilizes the prcomp function, which is used for Principal Component Analysis (PCA).

R
library(stats)

Step 2. Define the dataset

Now we specifies a dataset called data, with data of different chemical compounds measurements. For every row in the matrix, there is a specific compound, and each of the columns has a unique feature.

R
data <- data.frame(
  Feature1 = c(0.1, 0.3, 0.4, 0.7),
  Feature2 = c(0.2, 0.6, 0.1, 0.5),
  Feature3 = c(0.5, 0.2, 0.9, 0.4),
  Feature4 = c(0.4, 0.8, 0.3, 0.6)
)
data

Output:

  Feature1 Feature2 Feature3 Feature4
1 0.1 0.2 0.5 0.4
2 0.3 0.6 0.2 0.8
3 0.4 0.1 0.9 0.3
4 0.7 0.5 0.4 0.6

Step 3. Perform PCA using prcomp

The prcomp function is used to execute PCA on dataset data. The scale. scaling operates to have unit variance for PCA, which is a common preprocessing steps.

R
pca_result <- prcomp(data, scale. = TRUE)

summary(pca_result)

Output:

Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.710 1.0069 0.24952 1.938e-16
Proportion of Variance 0.731 0.2535 0.01557 0.000e+00
Cumulative Proportion 0.731 0.9844 1.00000 1.000e+00

In practical terms, the first two principal components (PC1 and PC2) are sufficient to explain most of the variability in the dataset (98.44%), which means we could reduce the dimensionality of the data to these two components with minimal loss of information. This dimensionality reduction can simplify analysis and visualization without significantly compromising the integrity of the data.

Step 4. Visualize the PCA

Now we will visualize the PCA results of the data frame that we calculated from prcomp function.

R
library(ggplot2)

scree_data <- data.frame(
  Component = 1:length(pca_result$sdev),
  Variance = pca_result$sdev^2 / sum(pca_result$sdev^2)
)

ggplot(scree_data, aes(x = Component, y = Variance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_line() +
  geom_point() +
  xlab("Principal Components") +
  ylab("Proportion of Variance Explained") +
  ggtitle("Scree Plot")

Output:


gh
prcomp in R


This plot shows the proportion of variance explained by each principal component. It helps in determining how many components should be considered for further analysis.

Practical Applications of PCA

  1. Dimensionality Reduction: Minimize the number of variability that do not convey the main information, rather than increase that number.
  2. Data Visualization: Convert high-dimensional data into a two-or three-dimensional display so that it becomes more explicit.
  3. Feature Engineering: Choose the key features to improve machine learning model performance and define an approach.

Conclusion

R has the PCA technique via prcomp which is very useful for data analysis and machine learning purposes. In this way, the professionals can make use of the mathematical concepts of dimensionality reduction to convert a complex dataset into a simplified one from which they can get valuable information. In the data science environment, PCA has found application in domains as wide as gets the imagination. It is still an essential part of many data science workflows.


Article Tags :

Similar Reads