0% found this document useful (0 votes)
63 views

Introduction To Data Analysis

This document describes data analysis techniques including principal component analysis (PCA) and exploratory factor analysis (EFA). PCA is introduced as a dimension reduction technique that transforms correlated variables into a smaller number of uncorrelated principal components that capture most of the original variability. The mathematics and process of PCA are explained, including how to determine the number of components to retain. An example of applying PCA to US crime data is also provided.

Uploaded by

Thu Le
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Introduction To Data Analysis

This document describes data analysis techniques including principal component analysis (PCA) and exploratory factor analysis (EFA). PCA is introduced as a dimension reduction technique that transforms correlated variables into a smaller number of uncorrelated principal components that capture most of the original variability. The mathematics and process of PCA are explained, including how to determine the number of components to retain. An example of applying PCA to US crime data is also provided.

Uploaded by

Thu Le
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Data Analysis

Jocelyn DONZE (Université de Strasbourg)


[email protected]
Contents
1. What is data analysis ?
2. Principal Component Analysis
3. Exploratory Factor Analysis
4. Clustering methods
1) What is data analysis ?
• It is a technique to deal with large data-sets (many
variables/individuals), summarize information, and identify
the structure of data.
• A data-set containing m variables is condensed into a reduced
set of p < m new variables.
• These new variables are called principal components (or
sometimes factors) that
• are uncorrelated between each other and
• explain a large proportion of the original variability.
What is data analysis ?
• Data reduction is a necessary step to simplify data treatment
when the number of variables is very large

• The description of complex interrelations between the


original variables is made easier by looking at the extracted
components.

• The price to pay for simplication is the loss of some of the


original information, because of the data reduction.
What is data analysis ?
• The higher the correlation across the original variables, the
smaller the number of components or factors needed to
adequately describe the phenomenon.
What is data analysis ?
Contents :

• Principal Component Analysis (PCA).


• Exploratory Factor analysis (EFA).
• Clustering
2) Principal Component Analysis
• Principal Component Analysis (PCA) is a dimension-reduction
tool that can be used to reduce a large set of quantitative
variables to a small set that still contains most of the
information of the large set.

• To do so, PCA transforms a number of (possibly) m correlated


variables into a (smaller) number of uncorrelated variables
called principal components (PCs).
Principal Component Analysis
• The dataset variables are usually standardized before analysis.
• Why? to compare variables expressed in different units.
𝑥𝑖 −𝑥ҧ
• Standardization of variable 𝑥: For any observation 𝑖, 𝑧𝑖 =
𝑠
σ(𝑥𝑖 −𝑥)²
ҧ
where 𝑥ҧ is the mean of 𝑥 and 𝑠 = is the sample
𝑛−1
standard deviation of 𝑥.
• The mean of z is 0. Its sample variance is 1.
• Standardization puts all variables on the same scale.
Construction of Principal Components
• The first principal component (PC1) accounts for as much of
the variability in the data as possible.

• Intuition for two variables (m=2).


Let us take the following dataset
with two variables, x and y :
Construction of Principal Components
• We want to identify the first principal component (PC1) that
explains the highest amount of variance.

• Graphically, we draw a line


that splits the oval lengthwise.
Construction of Principal Components
How do we determine the second principal component?

• For our two-dimensional dataset,


only two principal components.

• The second principal component


must be orthogonal to the first
principal component (why?
because PCS are uncorrelated)
Construction of Principal Components
How do we determine the second principal component?

• The second principal component captures the variance in the


data that is not captured by the first principal component.
Construction of Principal Components
• In fact PCA transforms the data into a new coordinate system
(in red) by rotation:
Reduction of Dimension
• Let's say we want to use PCA to reduce our two-dimensional
dataset onto a one-dimensional dataset.
• We collapse our dataset expressed with the new axis PC1 and
PC2 onto a single line (“projection”). The coordinates constitute
the summarized data.
Principal Component Analysis
• We have destroyed some of the original information when
we went from a two-dimensional dataset to the one-
dimensional projection.

• Although we lost some information in the transformation,


we did keep the most important axis, which incorporates
information from both x and y.
Principal Component Analysis

• With m variables, PCA is like rotating the orthogonal axes


(centered on the m means) so that
• The new first axis (the first PC) is associated to the maximum
variance.
• The new second axis (the second PC) is orthogonal to the first axis
and is associated to the maximum variance remaining.
• The new third axis (the third PC) is orthogonal to the first and
second axis and is associated to the maximum variance remaining.
• Etc…
Principal Component Analysis
• Maximizing the variance on the
first principal component is in
fact equivalent to minimizing the
distances between the
observations and their
projections on this first principal
component.
• The point of rotation « o » is
located at the average value of
the variables
Principal Component Analysis

• Eigenvalues are numbers telling you how much variance


there is in the different principal components.
• In the initial dataset with m (standardized) variables, the
total variance is 𝑚 × 1 = 𝑚.
• The sum of all eigenvalues is also equal to 𝑚. In the example
with x and y (m=2), we could have
Variance of 1st variable Variance of 2nd variable Total Variance
Initial variables (Standardized) 1 1 2
Transformed variables (PCs) 1.5 0.5 2
Principal Component Analysis _ Mathematics
• The new variables (principal components) PCk are linear
combinations of the original variables (the xs). The kth PC is

• PCk= ak1x1+ak2x2+…akmxm ; k = 1..m

• The PCk are derived in decreasing order of importance (that is,


variance).

• The ak1 , ak2 , …, akm constitute the eigenvector associated to


component k.
Principal Component Analysis _ Mathematics
• We have a²k1 + a²k2 + … +a²km = 1
How many components to keep ?
• We want to reduce the number of variables to p < m principal
components (m: initial number of variables) but also keep
sufficient information (variance).
• Different criteria :
• Take enough PCs to have a cumulative variance explained by the PCs, say 60-70%.
• Scree plot: represents the ability of PCs to explain de variation in data
• « Kaiser criterion »: keep PCs with eigenvalues >1.
• Or retain as many axes we can interpret !
Explanation of the Kaiser criterion
• With the Kaiser criterion, we add one extra PC as long as the variance
added by doing so is larger than the variance of each initial variable (= 1).

• Idea for the criterion: a factor must explain at least the same amount of
variance as a single variable.
Remark
• To perform the PCA, softwares use the covariance or the correlation
matrix.
• You can even perform a PCA with a correlation matrix only (that is,
without data)!

• Using the correlation matrix instead of the covariance matrix is the


same thing as standardizing the initial variables, because the
correlation measure is standardized (See Stata option).
• In this case, no need to standardize your data before the analysis.
Example : Crime data in the American states
• Individuals : 50 states.
• Variables :
• Murder,
• Rape,
• Assault,
• Burglary:
The act of entering another's premises without authorization in order to
commit a crime, such as theft.
• Larceny :
The unlawful removing of another's personal property with the intent of
permanently depriving the owner (theft).
• Autotheft.
STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO
Alabama 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7
Alaska 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3
Arizona 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5
Arkansas 8.8 27.6 83.2 203.4 972.6 1862.1 183.4
California 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5
Colorado 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1
Connecticut 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2
Delaware 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0
Florida 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4
Georgia 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9
Hawaii 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4
Idaho 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6
Illinois 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6
Indiana 7.4 26.5 123.2 153.5 1086.2 2498.7 377.4
Iowa 2.3 10.6 41.2 89.8 812.5 2685.1 219.9
Kansas 6.6 22.0 100.7 180.5 1270.4 2739.3 244.3
Kentucky 10.1 19.1 81.1 123.3 872.2 1662.1 245.4
Louisiana 15.5 30.9 142.9 335.5 1165.5 2469.9 337.7
Maine 2.4 13.5 38.7 170.0 1253.1 2350.7 246.9
Maryland 8.0 34.8 292.1 358.9 1400.0 3177.7 428.5
Massachusetts 3.1 20.8 169.1 231.6 1532.2 2311.31140.1
Michigan 9.3 38.9 261.9 274.6 1522.7 3159.0 545.5
Minnesota 2.7 19.5 85.9 85.8 1134.7 2559.3 343.1
Mississippi 14.3 19.6 65.7 189.1 915.6 1239.9 144.4
Missouri 9.6 28.3 189.0 233.5 1318.3 2424.2 378.4
Montana 5.4 16.7 39.2 156.8 804.9 2773.2 309.2
Nebraska 3.9 18.1 64.7 112.7 760.0 2316.1 249.1
Nevada 15.8 49.1 323.1 355.0 2453.1 4212.6 559.2
New Hampshire 3.2 10.7 23.2 76.0 1041.7 2343.9 293.4
New Jersey 5.6 21.0 180.4 185.1 1435.8 2774.5 511.5
Crime data (cont.)

New Mexico 8.8 39.1 109.6 343.4 1418.7 3008.6 259.5


New York 10.7 29.4 472.6 319.1 1728.0 2782.0 745.8
North Carolina 10.6 17.0 61.3 318.3 1154.1 2037.8 192.1
North Dakota 0.9 9.0 13.3 43.8 446.1 1843.0 144.7
Ohio 7.8 27.3 190.5 181.1 1216.0 2696.8 400.4
Oklahoma 8.6 29.2 73.8 205.0 1288.2 2228.1 326.8
Oregon 4.9 39.9 124.1 286.9 1636.4 3506.1 388.9
Pennsylvania 5.6 19.0 130.3 128.0 877.5 1624.1 333.2
Rhode Island 3.6 10.5 86.5 201.0 1489.5 2844.1 791.4
South Carolina 11.9 33.0 105.9 485.3 1613.6 2342.4 245.1
South Dakota 2.0 13.5 17.9 155.7 570.5 1704.4 147.5
Tennessee 10.1 29.7 145.8 203.9 1259.7 1776.5 314.0
Texas 13.3 33.8 152.4 208.2 1603.1 2988.7 397.6
Utah 3.5 20.3 68.8 147.3 1171.6 3004.6 334.5
Vermont 1.4 15.9 30.8 101.2 1348.2 2201.0 265.2
Virginia 9.0 23.3 92.1 165.7 986.2 2521.2 226.7
Washington 4.3 39.6 106.2 224.8 1605.6 3386.9 360.3
West Virginia 6.0 13.2 42.2 90.9 597.4 1341.7 163.3
Wisconsin 2.8 12.9 52.2 63.7 846.9 2614.2 220.7
Wyoming 5.4 21.9 39.7 173.9 811.6 2772.2 282.0
PCA: Example. Data editor on Stata
• Let us open the file by copying and pasting from excel to the data
editor (or open stata file).
PCA: Example. Command in Stata
• To perform the PCA, do the following:
PCA: Example, Eigenvalues and Principal Components
PCA: Example. Screeplot of Eigenvalues
PCA: Example. Screeplot of Eigenvalues
PCA: Example. Screeplot of Eigenvalues
• Here we can take the first two principal components.
• 76% of the variance explained.
• Eigenvalue of PC3 is less than one (Kaiser criterion).
PCA: Example. Interpretation of PCs
• PC1 = 0.300 * murder + 0.432 * rape + 0.397 * …

• Do the components mean anything?


• PC1 has “high” positive values on all crimes, and can therefore be interpreted
as a global measure of criminality and violence in the state .
• PC2 has “high” positive values on AUTO, LARCENY and ROBBERY and “high”
negative values on MURDER, ASSAULT and RAPE. It is interpreted as a
measure of the preponderance of property crime over violent crime…...
PCA: Example. Interpretation of PCs
• Interpretation is confirmed by looking at the correlations (the « loadings »)
between the initial variables and the PCs. The command is

• Negative corr. between PC2 & murder but positive with autotheft/larceny
PCA: Example.
• We draw the individuals (states) with their new coordinates.
PCA: Example
• The problem is that the individuals (the states) are not labelled.
• We can use another technique to save the new coordinates.
PCA: Example
• To get the coordinates of the individuals in the new system of principal
components. Type « PC1 PC2 PC3 » in the area « new variables names »
and type « OK ». PC1, PC2 and PC3 appear on the right as new variables;
• Then under stata, menu graphics > twoway graphs > create > basic plots >
scatter.
• Xvariable = PC1 and Yvariable = PC2.
• Add label to markers. Marker property = State.
PCA – Example 5 students and 3 grades

Student Gfinance Gmarketing Gpolicy


1 3 6 5
2 7 3 3
3 10 9 8
4 3 9 7
5 10 6 5
PCA – Example 5 students and 3 grades
2) Exploratory Factor Analysis (EFA)
• Exploratory Factor Analysis is a correlational method used to
find and describe the underlying factors driving data for large
set of variables.

• We sometimes talk of latent factors (latent = not observed).

• Exploratory Factor Analysis identifies correlations between


and among variables to bind them into one (or several)
underlying factor (factors) driving their values.
Exploratory Factor Analysis
• With EFA, you do not have a pre-defined idea of the structure or how
many dimensions are in a set of variables.

• In next chapter, we will talk of Confirmatory Factor Analysis (CFA).


• With CFA want to test specific hypothesis about the structure or the number
of dimensions underlying a set of variables
• For example in your data you may think there are two dimensions and you
want to verify that.

• We will study CFA in next chapter.


Exploratory Factor Analysis - example
V1 V2 V3 V4 V5 V6

7 -2 8 10 -23 -5

5 6 6 6 -19 4

9 -3 8 13 5 0

12 13 11 16 23 11

2 11 3 -1 101 8

7 -8 7 9 14 -15

6 21 5 5 29 18

4 5 3 8 6 26

1 3 2 3 5 13

5 7 4 7 2 1
Exploratory Factor Analysis - example
By looking at the table with the set of variables V1, V2, V3, V4,
V5, and V6

• One can see that variables V1, V3 and V4 look similar. They
might be related to the same factor.
• Variables V2, and V6 look similar. They might be related to
the same factor.
• V5 might be a factor on its own.
Exploratory Factor Analysis - Example

• This data set is explained by 3 factors rather than by 6


variables.
Exploratory Factor Analysis
We need to
• Determine the assumptions for factor analysis
• Develop a way of identifying factors.
• Determine if a factor is important of not.
• Examine the interaction of the variables on the factor.
Exploratory Factor Analysis - Assumptions
• Quantitative data.
• No outliers in the data set.
• Adequate sample size. Large data set
• No perfect multicollinearity. Each variable is unique.
• Variables do not need to have same variance.
• Linearity of variables. Variables must be linear in nature.
Exploratory Factor Analysis - Model

• The model is the following. We explain the m initial variables


by p factors (latent variables: they are not observed).
Exploratory Factor Analysis - Model

• The s are called the loadings.

• Factor loadings represent how much a factor explains a


variable in factor analysis (correlation). They belong to [0,1].

• For any l in {1,m} and k in {1,p}, the loading lk represents


the correlation between variable xl and factor Fk.
Exploratory Factor Analysis - Assumptions

• Assumptions of the model.

• For any k in {1,p}, the unobservable factors Fk are independent of


one another and of the error.

• For any l in {1,m}, the error terms are such that E(el) = 0 and Var(el)
= σl ²
Exploratory Factor Analysis. Uniqueness and
communality

• The uniqueness of the variable Xi belongs to [0,1] and is a measure


of what is not explained by the common factors F1, …, Fp.
• The communality of the variable Xi belongs to [0,1] and is a
measure of what is explained by the common factors F1, …, Fp.

• Uniqueness + Communality = 1.
Exploratory Factor Analysis - Example
Grades of 5 MBA students
Student Gfinance Gmarketing Gpolicy
1 3 6 5
2 7 3 3
3 10 9 8
4 3 9 7
5 10 6 5
Exploratory Factor analysis - Example
Exploratory Factor Analysis - Example
• We retain the number of factors looking at the eigenvalues. Same
procedure as in PCA. Here we can retain two factors.
• Here
• Gfinance = 0.0299  F1 + 0.995  F2 + e1
• Gmarketing = 0.9941  F1 - 0.0815  F2 + e2
• Gpolicy = 0.9961  F1 + 0.0514  F2 + e3

• Factors are clearly defined.


• Factor 1 can be interpreted as « verbal skills » (corr of 0.995 with Gfinance)
• Factor 2 can be interpreted as « quantitative skills » (corr of 0.941 with
Gmarketing ; corr of 0.9961 with Gpolicy).
Exploratory Factor Analysis - Example
• Possible interpretation: In this MBA program, Finance is highly
quantitative, while marketing and policy have a strong qualitative
orientation.
• Quantitative skills (F2) should help a student in finance, but not in
marketing or policy. Verbal skills (F1) should be helpful in
marketing or policy but not in finance.
Effect of factor on F1 (verbal skills) F2 (quantitative skills)
Gfinance . +
Gmarketing + .
Gpolicy + .
Exploratory Factor Analysis - Example 2
• The school system of a major city wanted to determine the
characteristics of a great teacher, and so they asked 120
students to rate the importance of each of the following 9
criteria using a Likert scale of 1 to 10.
• 10: the characteristic is extremely important
• 1 : the characteristic is not important at all.
Exploratory Factor Analysis - Example 2
The measured characteristics :
• Setting high expectations for the students
• Entertaining
• Able to communicate effectively
• Having expertise in their subject
• Able to motivate
• Caring
• Charismatic
• Having a passion for teaching
• Friendly and easy-going
Exploratory Factor Analysis - example 2
• Here we take four factors (Kaiser criterion).

• First factor measures charisma, enthousiasm and ability to communicate.


• Second factor measures « strict » (high expectations) vs « friendly »
teacher.
• Third factor measures (lack of) caring or attention of the teacher.
• Fourth factor measures teacher’s expertise in the subject.

• As a rule of thumb, one considers that a factor loading does affect a


variable when it is above |0.4|.
Exploratory Factor Analysis - example 2
• When the retained factors are not easy to interpret, one can do a varimax
rotation by rotating the factors.

• Varimax is an orthogonal rotation method that tends produce factor


loading that are either very high or very low, making it easier to match each
item with a single factor.
3) Cluster Analysis
• Several Individuals – several quantitative variables
• Cluster analysis is a class of techniques used to classify individuals
into groups that are
• relatively homogeneous within themselves and
• heterogeneous between each other
• Homogeneity (similarity) and heterogeneity (dissimilarity) are measured on
the basis of a defined set of variables
• These groups are called clusters
Cluster Analysis - Individuals
• Individuals can be
• Consumers (« market segmentation »),
• Products.
• Countries.
• Etc …
Cluster Analysis – Measuring Similarity
• To measure similarity between two observations a distance measure is needed
• The most known measure of distance is the Euclidean distance
• the concept we use in everyday life for spatial coordinates.

 ( xki − xkj )
n
2
• Euclian Distance: Dij =
k =1

Dij distance between individuals i and j


xkj value of variable xk for individual j

Potential Problems
• Different measures (scales) = different weights
• Correlation between variables (double counting)

Solution (if needed) : Standardization and principal component analysis


Cluster Analysis – Clustering methods
Two types of clustering methods :

• Hierarchical procedures
• Agglomerative (start from n clusters to get to 1 cluster)
• Divisive (start from 1 cluster to get to n clusters)

• Non hierarchical procedures (the ones we will focus on)


• K-means clustering
• ...
Cluster Analysis – Hierarchical procedure
• Agglomerative:
• Each of the n observations constitutes a separate cluster
• The two closest clusters are aggregated, so that in step 1 there are n-1 clusters
• In the 2nd step another cluster is formed (n-2 clusters), by nesting the two closest clusters,
and so on
• There is a merging in each step until all observations end up in a single cluster in the final
step.

• Divisive :
• All observations are initially assumed to belong to a single cluster
• The most dissimilar observation is extracted to form a separate cluster
• In step 1 there will be 2 clusters, in the second step three clusters and so on, until the final
step will produce as many clusters as the number of observations.
Cluster Analysis. Non Hierarchical procedure: Kmeans
• Knowledge of the number of clusters (k)
is required
• First, initial cluster centres (the seeds) are
determined for each of the k clusters
(usually random choices -> triangles)
• Each iteration allocates observations to
each of the k clusters, based on their
distance from the cluster centres
• Cluster centres are computed again and
observations may be reallocated to the
nearest cluster in the next iteration (new
triangle)
• When no observations can be
reallocated, the process stops.
Example « Datacrime » Database
• Stata code for k = 2:

cluster kmeans murder rape robbe assau burgla larcen auto, k(2)
measure(L2) start(krandom)

• Stata creates a new variable : clus_XX which takes values 1 and 2 because
k=2.

• Let us do the same for k=3, k=4, k=5. Other new variables created.
How to choose k ?
• It can be based on researcher’s knowledge or marketer’s decision. Or,
more statistically with the “Pseudo F”.

• Pseudo F statistics describes the ratio of between cluster variance to


within-cluster variance.
• Between cluster variance : dissimilarity between clusters.
• Within-cluster variance : heterogeneity inside clusters.

• The ratio must be high to have a strong dissimilarity between clusters


and a weak heterogeneity inside clusters.
How to choose k ?
Stata command if k = 2

cluster stop _clus_1, rule(calinski)

k= 2, F = 61.68
k = 3, F = 65.27 <- The best
k = 4, F = 64.29 <- Good too
k = 5, F= 53.04
Then F decreasing
How to choose k ?
• k = 3 is the best from a statistical point of view but the researcher has
the last word !
• k = 4 is good too. (Easier interpretation ?)
Interpretation
• To help interpretation of classes, one can do a PCA. Commands:

pca murder-auto
predict PC1 PC2, score
by _clus_2, sort : summarize PC1
by _clus_2, sort : summarize PC2

Group1 contains states with low levels of crimes but rather of violent type.
Group2 contains states with high levels of crimes but rather of property.
Group 3 contains states with intermediate levels of crimes of any types.

You might also like