Multivariate Analysis in R
Last Updated :
28 Apr, 2025
Multivariate analysis refers to the statistical techniques used to analyze data sets with multiple variables. It helps uncover relationships, reduce complexity and interpret underlying structures in data. These variables can be quantitative or categorical and analyzing them together helps us understanding complex relationships within data. In this article, we will explore some common multivariate analysis methods in R programming language.
Multivariate Analysis Techniques
Some of the multivariate analysis methods in R that are most frequently used are as follows:
- Principal Component Analysis (PCA): is a technique for reducing the dimensionality of a dataset. With the help of this method you may narrow down the dataset's most crucial variables and see the information in a smaller dimension.
- Factor Analysis (FA): is a statistical method used to identify hidden latent variables that explain the patterns of correlations among observed variables.
- Cluster Analysis : It is an unsupervised learning technique that groups similar observations into clusters based on their characteristics or distance measures.
- Discriminant Analysis: is a classification technique used to separate observations into predefined groups by identifying variables that best distinguish between the groups.
- Canonical Correlation Analysis (CCA): is a multivariate method used to explore the relationship between two sets of variables by finding linear combinations that are maximally correlated across the sets.
- Multidimensional Scaling (MDS): is a technique for visualizing the level of similarity or dissimilarity between observations in a lower-dimensional space, typically based on a distance matrix.
- Correspondence Analysis (CA): is an exploratory data analysis technique used to visualize the relationships between categorical variables in a contingency table.
Using the built-in iris data set in R, the following example shows how to perform PCA on a data set:
R
data(iris)
vars <- c("Sepal.Length", "Sepal.Width",
"Petal.Length", "Petal.Width")
data_subset <- iris[, vars]
data_scaled <- scale(data_subset)
pca <- prcomp(data_scaled,
center = TRUE, scale. = TRUE)
summary(pca)
Output:
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
The results of the PCA are summarized in this output includes the standard deviation, variance proportion and cumulative proportion for each principal component. The data is efficiently reduced to three dimensions because the cumulative proportion reveals that the first three components account for more than 99% of the overall variance in the data.
1. Different Visualizations for the Dataset
We can better understand the connections between the variables and spot any patterns or trends by visualizing the data. To construct several plot types in R including scatter plots and histograms we can use ggplot2 library.
R
install.packages("ggplot2")
library(ggplot2)
data <- data.frame(
var1 = rnorm(100),
var2 = rnorm(100),
group = sample(1:4, 100, replace = TRUE)
)
ggplot(data, aes(x = var1, y = var2)) +
geom_point()
Output:
Scatterplot
R
ggplot(data, aes(x = factor(group), y = var1)) +
geom_boxplot()
Output:
Boxplot
R
ggplot(data, aes(x = var1)) +
geom_histogram()
Output:
Histogram using ggplot2A correlation matrix plot can also be made using the corrplot() method from the corrplot package.
R
install.packages("corrplot")
library(corrplot)
corrplot(cor(data), method = "circle")
Output:
Correlation plot using corrplot package in R2. Descriptive Statistical Measures
In multivariate analysis variance, covariance and correlation are crucial measurements because they allow us to understand the connections between the variables. Many functions in R can be used to compute these metrics.
R
var(data$var1)
cov(data$var1, data$var2)
cor(data$var1, data$var2)
Output:
0.964993019401173
-0.131206113335423
-0.133108806509815
The psych library can also be used to compute various metrics including skewness, kurtosis and factor analysis.
R
install.packages("moments")
install.packages("psych")
library(moments)
library(psych)
skewness(data$var1)
kurtosis(data$var1)
fa(data)
Output:
-0.113671043634579
2.58907790883746
R
#Factor Analysis
fa(data)
Output:
Factor Analysis using method = minres
Call: fa(r = data)
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 h2 u2 com
var1 1.00 0.9957 0.0043 1
var2 -0.13 0.0171 0.9829 1
group -0.08 0.0062 0.9938 1
MR1
SS loadings 1.02
Proportion Var 0.34
Mean item complexity = 1
Test of the hypothesis that 1 factor is sufficient.
df null model = 3 with the objective function = 0.03 with Chi Square = 2.53
df of the model are 0 and the objective function was 0
The root mean square of the residuals (RMSR) is 0.02
The df corrected root mean square of the residuals is NA
The harmonic n.obs is 100 with the empirical chi square 0.23 with prob < NA
The total n.obs was 100 with Likelihood Chi Square = 0.12 with prob < NA
Tucker Lewis Index of factoring reliability = Inf
Fit based upon off diagonal values = 0.95
Measures of factor score adequacy
MR1
Correlation of (regression) scores with factors 1.00
Multiple R square of scores with factors 1.00
Minimum correlation of possible factor scores 0.99
3. PCA and LDA
Two well known methods for multivariate analysis are PCA (Principal Component Analysis) and LDA (Linear Discriminant Analysis). Dimensionality reduction is done with PCA and classification is done with LDA.
- We can use the lda() function from the MASS library and the prcomp() function from the stats package. The coefficients of the linear discriminants and their accompanying classification accuracies are provided by this.
- The prcomp() method returns the dataset's major components, their variances and the percentages of total variance they account for.
R
install.packages("MASS")
install.packages("stats")
library(stats)
library(MASS)
pca <- prcomp(data[, 1:3])
summary(pca)
lda <- lda(group ~ var1 + var2, data = data)
summary(lda)
Output:
Importance of components:
PC1 PC2 PC3
Standard deviation 1.0946 1.0498 0.9119
Proportion of Variance 0.3826 0.3519 0.2655
Cumulative Proportion 0.3826 0.7345 1.0000
Length Class Mode
prior 4 -none- numeric
counts 4 -none- numeric
means 8 -none- numeric
scaling 4 -none- numeric
lev 4 -none- character
svd 2 -none- numeric
N 1 -none- numeric
call 3 -none- call
terms 3 terms call
xlevels 0 -none- list
With these methods we can do multivariate analysis in R.
Similar Reads
Non-linear Components In electrical circuits, Non-linear Components are electronic devices that need an external power source to operate actively. Non-Linear Components are those that are changed with respect to the voltage and current. Elements that do not follow ohm's law are called Non-linear Components. Non-linear Co
11 min read
Spring Boot Tutorial Spring Boot is a Java framework that makes it easier to create and run Java applications. It simplifies the configuration and setup process, allowing developers to focus more on writing code for their applications. This Spring Boot Tutorial is a comprehensive guide that covers both basic and advance
10 min read
Class Diagram | Unified Modeling Language (UML) A UML class diagram is a visual tool that represents the structure of a system by showing its classes, attributes, methods, and the relationships between them. It helps everyone involved in a projectâlike developers and designersâunderstand how the system is organized and how its components interact
12 min read
Backpropagation in Neural Network Back Propagation is also known as "Backward Propagation of Errors" is a method used to train neural network . Its goal is to reduce the difference between the modelâs predicted output and the actual output by adjusting the weights and biases in the network.It works iteratively to adjust weights and
9 min read
3-Phase Inverter An inverter is a fundamental electrical device designed primarily for the conversion of direct current into alternating current . This versatile device , also known as a variable frequency drive , plays a vital role in a wide range of applications , including variable frequency drives and high power
13 min read
Polymorphism in Java Polymorphism in Java is one of the core concepts in object-oriented programming (OOP) that allows objects to behave differently based on their specific class type. The word polymorphism means having many forms, and it comes from the Greek words poly (many) and morph (forms), this means one entity ca
7 min read
CTE in SQL In SQL, a Common Table Expression (CTE) is an essential tool for simplifying complex queries and making them more readable. By defining temporary result sets that can be referenced multiple times, a CTE in SQL allows developers to break down complicated logic into manageable parts. CTEs help with hi
6 min read
What is Vacuum Circuit Breaker? A vacuum circuit breaker is a type of breaker that utilizes a vacuum as the medium to extinguish electrical arcs. Within this circuit breaker, there is a vacuum interrupter that houses the stationary and mobile contacts in a permanently sealed enclosure. When the contacts are separated in a high vac
13 min read
Python Variables In Python, variables are used to store data that can be referenced and manipulated during program execution. A variable is essentially a name that is assigned to a value. Unlike many other programming languages, Python variables do not require explicit declaration of type. The type of the variable i
6 min read
Spring Boot Interview Questions and Answers Spring Boot is a Java-based framework used to develop stand-alone, production-ready applications with minimal configuration. Introduced by Pivotal in 2014, it simplifies the development of Spring applications by offering embedded servers, auto-configuration, and fast startup. Many top companies, inc
15+ min read