Introduction To Data Analysis
Introduction To Data Analysis
• Idea for the criterion: a factor must explain at least the same amount of
variance as a single variable.
Remark
• To perform the PCA, softwares use the covariance or the correlation
matrix.
• You can even perform a PCA with a correlation matrix only (that is,
without data)!
• Negative corr. between PC2 & murder but positive with autotheft/larceny
PCA: Example.
• We draw the individuals (states) with their new coordinates.
PCA: Example
• The problem is that the individuals (the states) are not labelled.
• We can use another technique to save the new coordinates.
PCA: Example
• To get the coordinates of the individuals in the new system of principal
components. Type « PC1 PC2 PC3 » in the area « new variables names »
and type « OK ». PC1, PC2 and PC3 appear on the right as new variables;
• Then under stata, menu graphics > twoway graphs > create > basic plots >
scatter.
• Xvariable = PC1 and Yvariable = PC2.
• Add label to markers. Marker property = State.
PCA – Example 5 students and 3 grades
7 -2 8 10 -23 -5
5 6 6 6 -19 4
9 -3 8 13 5 0
12 13 11 16 23 11
2 11 3 -1 101 8
7 -8 7 9 14 -15
6 21 5 5 29 18
4 5 3 8 6 26
1 3 2 3 5 13
5 7 4 7 2 1
Exploratory Factor Analysis - example
By looking at the table with the set of variables V1, V2, V3, V4,
V5, and V6
• One can see that variables V1, V3 and V4 look similar. They
might be related to the same factor.
• Variables V2, and V6 look similar. They might be related to
the same factor.
• V5 might be a factor on its own.
Exploratory Factor Analysis - Example
• For any l in {1,m}, the error terms are such that E(el) = 0 and Var(el)
= σl ²
Exploratory Factor Analysis. Uniqueness and
communality
• Uniqueness + Communality = 1.
Exploratory Factor Analysis - Example
Grades of 5 MBA students
Student Gfinance Gmarketing Gpolicy
1 3 6 5
2 7 3 3
3 10 9 8
4 3 9 7
5 10 6 5
Exploratory Factor analysis - Example
Exploratory Factor Analysis - Example
• We retain the number of factors looking at the eigenvalues. Same
procedure as in PCA. Here we can retain two factors.
• Here
• Gfinance = 0.0299 F1 + 0.995 F2 + e1
• Gmarketing = 0.9941 F1 - 0.0815 F2 + e2
• Gpolicy = 0.9961 F1 + 0.0514 F2 + e3
( xki − xkj )
n
2
• Euclian Distance: Dij =
k =1
Potential Problems
• Different measures (scales) = different weights
• Correlation between variables (double counting)
• Hierarchical procedures
• Agglomerative (start from n clusters to get to 1 cluster)
• Divisive (start from 1 cluster to get to n clusters)
• Divisive :
• All observations are initially assumed to belong to a single cluster
• The most dissimilar observation is extracted to form a separate cluster
• In step 1 there will be 2 clusters, in the second step three clusters and so on, until the final
step will produce as many clusters as the number of observations.
Cluster Analysis. Non Hierarchical procedure: Kmeans
• Knowledge of the number of clusters (k)
is required
• First, initial cluster centres (the seeds) are
determined for each of the k clusters
(usually random choices -> triangles)
• Each iteration allocates observations to
each of the k clusters, based on their
distance from the cluster centres
• Cluster centres are computed again and
observations may be reallocated to the
nearest cluster in the next iteration (new
triangle)
• When no observations can be
reallocated, the process stops.
Example « Datacrime » Database
• Stata code for k = 2:
cluster kmeans murder rape robbe assau burgla larcen auto, k(2)
measure(L2) start(krandom)
• Stata creates a new variable : clus_XX which takes values 1 and 2 because
k=2.
• Let us do the same for k=3, k=4, k=5. Other new variables created.
How to choose k ?
• It can be based on researcher’s knowledge or marketer’s decision. Or,
more statistically with the “Pseudo F”.
k= 2, F = 61.68
k = 3, F = 65.27 <- The best
k = 4, F = 64.29 <- Good too
k = 5, F= 53.04
Then F decreasing
How to choose k ?
• k = 3 is the best from a statistical point of view but the researcher has
the last word !
• k = 4 is good too. (Easier interpretation ?)
Interpretation
• To help interpretation of classes, one can do a PCA. Commands:
pca murder-auto
predict PC1 PC2, score
by _clus_2, sort : summarize PC1
by _clus_2, sort : summarize PC2
Group1 contains states with low levels of crimes but rather of violent type.
Group2 contains states with high levels of crimes but rather of property.
Group 3 contains states with intermediate levels of crimes of any types.