ACPusingR
ACPusingR
net/publication/340262012
CITATIONS READS
0 6,040
1 author:
Sami Mestiri
Faculté des Sciences Économiques et de Gestion de Mahdia
70 PUBLICATIONS 183 CITATIONS
SEE PROFILE
All content following this page was uploaded by Sami Mestiri on 31 December 2023.
Sami Mestiri
Email: mestirisami2007 ∂ gmail.com
Faculté des sciences économiques et de gestion de Mahdia,
Université de Monastir, Tunisie.
1
Introduction
Conventional statistical techniques such as correlation,
regression and scatter plot only allowed studying of the
relationship between two variables.
Problems: Existence of a large number of variables
Statistical methods are used in many areas and have
been driven by the development of computers.
The perception of a phenomenon apprehended by the
combination of a number of variables
How to account for all of the information?
2
Definition
Principal component analysis is performed in order to
simplify the description of a set of interrelated quanti-
tative variables Xp . The technique can be summarized
as a method of transforming the original variables into
new, uncorrelated variables.
The new variables are called the principal components.
Each principal component is a linear combination of
the original variables.
The goal of PCA is to find the best low-dimensional
representation of the variation in a multivariate data
set. Thus, it is one of the classical methods in di-
mension reductions. It reduces the number of vari-
ables/dimensions without losing much of the informa- 3
Data Table
1 p P
1 -
-
i – – Xip – –
-
L -
Xip = Value of the variable p for observation i
L = number of observations (Similarity)
P = number of variables (Correlation)
4
Principal component
For a given set of data vectors xi , i = 1, .., p, the d principal axes
are those orthonormal axes onto which the variance retained
under projection is maximal.
In order to capture as much of the variability as possible, let us
choose the first principal component, denoted by U1 , to have
maximum variance.
Suppose that all centered observations are stacked into the
columns of an n, p matrix X, where each column corresponds
to an n-dimensional observation and there are p observations.
Let the first principal component be a linear combination of X
defined by coefficients (or weights) w = [w1 , ..., wp ].
(1) (1)
U1 = w1 x1 + w2 x2 + ... + wp(1) xp
In matrix form:
U1 = w T X
5
We want this first dimension to have maximum variance.
var (U1 ) = var (w T X ) = w T Sw
0
where S = cov (X , X ) is sample covariance matrix of X.
Clearly var (U1 ) can be made arbitrarily large by increasing the
magnitude of w. This means that the variance stated above has
no upper limit and so we can not find the maximum. To solve
this problem, we choose w to maximize w T Sw while constrain-
ing w to have unit length. Therefore, we can rewrite the above
equation as:
max w T Sw subject to w T w = 1
To solve this optimization problem a Lagrange multiplier α is
introduced:
L(w , α) = w T Sw − α(w T w − 1)
Differentiating with respect to w gives n equations,
6
Premultiplying both sides by w T we have:
w T Sw = αw T w = α
var (U1 ) is maximized if α is the largest eigenvalue of S.
Clearly α and w are an eigenvalue and an eigenvector of S.
Differentiating (1) with respect to the Lagrange multiplier α
gives us back the constraint:
wTw = 1
This shows that the first principal component is given by the
normalized eigenvector with the largest associated eigenvalue of
the sample covariance matrix S. A similar argument can show
that the d dominant eigenvectors of covariance matrix S deter-
mine the first d principal components.
7
Guide analysis for ACP
Step 1: Selection of axes and planes held principally
with respect to values of eigenvalue.
Step 2: Projection of variables and individuals in a
given plane (F1, F2 )
- Review qlt in the plan to eliminate the poorly repre-
sented individual
- Total ctr for an axis to give meaning to this axis
(opposition trend ...)
- Topography of variables and individuals so proud of
identifiable groups, oppositions, trends
- Using his knowledge of the subject to offer an expla-
nation of the results of the analysis
8
Application
This data table contains the technical data of 62 vehicles -
model year 1994. The variables are: :
row.names : Model name
Power : In fiscal HP
Cylinder : in cm3
Length : Car length
Width : Car width
Area : The car surface
Weight : Total weight in Kg
Speed : Maximum speed in km / h
DepArret : Time, in seconds, to move through 1000 m standing
start.
Conso : Average consumption per 100 km, in liters (gasoline or
diesel)
9
Loading data:
autos.data=read.table("voiture1.txt",header = TRUE, sep =
"")
It eliminates the 10 individuals with missing values
autos.data= na.omit(autos.data)
The identifier of individuals is row.names(autos.data) It contains
the names of vehicles
The importance that take the variables in the calculation of the
main components depends on their magnitude.
A variable with a large standard deviation will carry more weight
than a low standard deviation variable.
Strong deviation of the variables "build" the first components.
The calculations are not wrong, but reading the results of a
PCA can become complicated.
10
The histograms for all variables
layout(matrix(c(1:9),3,3))
for(i in 1:9) {
hist(autos.data[,i],main=names(autos.data)[i],xlab="")
}
layout(1)
11
It is advised in the use FactoMineR library of R.
The results of the analysis are stored in the object autos.acp
library(FactoMineR)
autos.acp = PCA(autos.data,scale.unit = T,graph=F))
Choosing Analysis Type:
The options scale.unit of function PCA are used to center and
reduce variable.
To set the number of axes studied, the values obtained are
studied. Each eigenvalue corresponds to the share of projected
inertia about a given axis.
It characterizes each axis by the percentage of inertia that can
explain.
It therefore holds that the axes with the highest values
12
The eigenvalues
The choice of the selected axis is a little bit tricky. We
can give some rules:
- Rule Kaiser standard PCA: one is interested
that the axes with an eigenvalue greater than 1 (=
initial inertia of a variable).
- Rule of minimum inertia are selected first axes to
achieve a given inertia explained% (70% for example).
- Rule elbow: There is often strong values at baseline
and then low values with a stall in the diagram. axes
is retained before the stall.
- Rule of common sense: We analyse the plans
and axes and one retains only those inter
13
Printing eigenvalues (variances of each component):
val.propres = round(autos.acp$eig[,1],2)
[1] 6.45 1.14 0.66 0.33 0.24 0.10 0.04 0.04 0
Cumulative variance :
variances.cumulates = round(autos.acp$eig[,2],2)
[1] 6.45 7.59 8.25 8.58 8.82 8.92 8.96 9.00 9
The variances in cumulative percentages :
cumulative.percentage.variance= round(autos.acp$eig[,3],2
[1] 71.70 12.64 7.30 3.69 2.63 1.11 0.48 0.43
14
Presentation of the results - the main plan
The results of CPA are stored in the object autos.acp.
The coordinates of the rows and the columns are respectively
in:
For individuals
autos.acp$ind$coord[,1:2]
For variables
autos.acp$var$cor[,1:2]
The main components are constructed as linear combinations
of the initial variables.
To display the connections between the main component and
the initial variables, there is shown in standard PCA variables in
the factorial design.
The variable coordinates are correlation coefficients of these
variables with principal components.
15
The graphical representation of variable plan
plot(autos.acp,choix="var",title="Correlation circle")
16
Interpretation - variables plan
The read policies of the circle of correlations:
-It takes into account variables that close the circle of corre-
lations. Otherwise, the variable is not correlated to the main
component and is therefore poorly represented.
-The connection between well represented variables can be anal-
ysed across the direction and the sense of vector:
+if the vectors have the same direction and the same sense, the
variables are positively correlated,
+if the vectors have the same direction but opposite, the vari-
ables are negatively correlated,
+ if the vectors are perpendicular, the variables are uncorrelated.
17
Interpretation - individuals plan
Individuals are associated with points in space whose coordi-
nates are the variables.
We can measure the distance between these individuals using
classical Euclidean distance between these two points.
The construction of the main components to reach its mini-
mum information distances between individuals when projected
individuals in the factorial design (F1 ,F2).
So the distances are observed between individuals in the factorial
design are generally the closest possible to the actual distances
between these individuals.
The analysis of factorial and can observe individuals close to-
gether or far otherwise. It is possible to build groups, observe
trends ...
18
Graphical representation of individuals plan
plot(autos.acp,choix="ind",title="Individus - 1er plan")
19
Interpretation - individuals plan
The read policies factorial design are:
- Only individuals well represented are taken into account in the
interpretation.
+ Is calculated the sum of qlt in the plane and it is verified that
this sum is not too small relative to the average quality of the
map.
- It achieves the balance in positive and negative individuals who
have the largest contribution to a given axis.
+ and is given in parallel with the variable analysis concreted
meaning to these axes in terms of opposition between individuals
and variables or particular trend.
- groups is accomplished using any of the s.class function, in the
case of pre-existing groups (gender for example) or arbitrarily
constructed these groups because of the similarities between
individuals.
- In the presence of too many people, we can use individual type
20
Contributions
The contributions of the variables at the construction of the
axes.These parameters are obtained with the following com-
mands :
For the lines (individuals)
ctr des lines in %
inertieL=autos.acp$ind$contrib[,1:2],digits=2
qlt of lines in %
inertieL=autos.acp$ind$cos2[,1:2],digits=2
For the columns (variables)
ctr columns in %
inertieC= autos.acp$var$contrib[,1:2],digits=2
qlt columnss in %
inertieC= autos.acp$var$cos2[,1:2],digits=2
21
Contributions for variables
When building a factor axis, some variables and some individuals
have more important roles.
We must calculates a parameter called contribution (ctr), which
calculates that influence.
(ctr) contribution is defined as the proportion of the inertia of
the axis explained by the variable or the individual.
Rules of Interpretation:
-The analysis is done one axis, in parallel on the variables and
individuals.
-More ctr, the greater the influence of the individual is great.
It therefore holds that the highest values (there is often a stall
after a few values).
-ctr is considered positive if the individual is in the positive part
of the axis.
-ctr is considered negative if the individual is in the negative
portion of the axis.
22
Contributions for individuals
The individuals represented in a factorial design are not nec-
essarily properly represented. If the angle is great (cos2), the
starting point is from its projection.
The parameter(qlt) used to characterize the quality of re-presentation
on an axis.
qlt actually corresponds to the ratio of the inertia of projected
on the inertia of the initial point.
- More qlti is close to 1 the greater is well represented.
- More qlti is close to 0 is more poorly represented.
- In a plane, one calculates the sum of the two qlt, e.g. qltF1
+ qltF2 for F1 F2 plane.
Overall quality: In a given level, it also defines the overall
quality as the percentage of inertia explained by the plan. This
is compared to the overall quality that is assessed qlt of an
individual or of a variable.
23
Reference
Mestiri, S. (2019) How to use the R software. University
of Monastir Press. DOI 10.13140/RG.2.2.18152.83206
Mestiri, S. (2023). Initiation à l’utilisation du logiciel R
Editions universitaires européennes.DOI:
10.13140/RG.2.2.16938.08641/1
Mestiri, S. (2019). Statistical tests using R Software
University of Monastir Press. DOI:
10.13140/RG.2.2.29300.50566
Mestiri, S.and Farhat, A. (2021) Using Non-parametric
Count Model for Credit Scoring. Journal of Quantitative
Econonomics Vol.19, pages 39-49 .
Venables, W. Ripley, B. (1999), Modern applied statistics
with S-Plus, 3rd edn, SpringerVerlag, New York., .
View publication stats
24