Assignment 1 2020
Assignment 1 2020
Instructions:
Submit only one file in pdf format to the link on the Study Desk.
Assume that your report will be read by someone familiar with the data set but
with limited statistical knowledge. Fully explain plots and when stating statistics
or results explain what they mean statistically AND in context of the data.
Presentation should be neat, consistent, spell-checked and proof read. All
questions should be clearly labelled, and all answers should clearly and concisely
address the questions.
If you convert a Word document to pdf for submission check that all symbols,
equations etc. have converted correctly, i.e., proof-read your work.
All answers must be typed – do not include handwritten/scanned or stylus/tablet
written responses in your document.
If you do not use knitr to compile your submission, where asked to provide R
code, paste relevant code within the assignment document and italicise (or
otherwise highlight or distinguish from other content). Do not include code in an
appendix.
Do not include an appendix at all. Any work included in an appendix will not be
marked.
Please note that referencing text books and other resources is not the goal of this
assessment. This work requires students to demonstrate their understanding of
the analysis and interpretation, not provide quotes from resources.
When interpreting output, you are expected to do so in context of the data and
the method (i.e. ensure you comment on aspects of the method that affect your
interpretation with the respect to the variables and sample).
A maximum of 10 marks will be deducted from your total marks for poor
presentation.
Marks:
Question 1: 25
Question 2: 20
Question 3: 30
Question 4: 25
Page 1 of 5
Data File:
The same data set will be used for all four questions in this assignment.
The data file ‘europegroup.txt’ contains data for the percentage of employment by
country (n=30 countries). The first variable identifies the region of the country (Group)
and the next nine variables represent different employment sectors: AGR=agriculture,
forests and fishing; MIN=mining; MAN=manufacturing; PS=power and water supplies;
CON=construction; SER=services; FIN=finance; SPS=social and personal services;
TC=transport and communication. Although you may not find these data to be MVN in
Question 1, you should proceed with all analysis requested in Questions 2 to 4 assuming
MVN, and comment on this limitation where relevant.
e) One way to try and meet the MVN assumption could be to remove some of the
variables from the multivariate analysis (do not perform this analysis). Suggest three
additional ways that you might improve univariate and multivariate normality for
data sets in general. (3 marks total)
Page 2 of 5
f) In part e) we suggested removing some variables to try and help the data approach
MVN. Suggest one other reason why reducing the number of variables used in
multivariate analysis may be important (this question does not relate specifically to
this particular data set)? (2 marks total)
g) In part e) we suggested removing some variables to try and help the data approach
MVN. Check to see if the data is MVN if only those variables that are univariate
normal (UVN) are used (1 mark). Is your result reasonable given your understanding
of the relationship between UVN and MVN (2 marks)? (3 marks total)
a) Produce a draftsman display for the employment variables. Use the function
scatterplotMatrix (from week 2) and check the help documentation
(?scatterplotMatrix) to help you produce a plot with observations grouped by regional
group using different colours and include the associated legend. Your plot should not
include smoothing, regression lines, or distribution curves in the diagonal panels of
the plot (1 mark). Interpret these plots, relating back to the original data where it
may add to the interpretation (2 marks). What are the y and x axes on plot [3,2] of
the draftsman plot (1 mark)? (4 marks total)
b) In the context of MANOVA, list the dependent and independent variables (1 mark)
and define the relationship that the MANOVA would test (1 mark). (2 marks total)
c) Using MANOVA in R, test for differences in ‘percentage of employment’ between the
four country regions. Include tests using all four test statistics covered in this course
(2 marks) and interpret output (3 marks). (5 marks total)
d) Explain how Wilk’s Lambda statistic is calculated and why a small statistic is likely to
indicate significant differences between at least some groups (2 marks). Which of the
four tests used in part c) would be the best to interpret if there are concerns about
multivariate normality or covariance equality (1 mark)? (3 marks total)
e) Produce output that specifically compares each of the regions (Group) with each
other (you should have 6 comparisons) using Hotelling’s T2 test and a significance
level of 0.05 (2 marks). Determine the multiple test corrected significance level (1
mark). Do not provide R output; instead reproduce and complete the following table
for all comparisons and interpret. What were the sample sizes for each region and
Page 3 of 5
how may sample sizes have affected these results and those in part c) (2 marks)?
Will deviation from MVN influence these results (1 mark)? (6 marks total)
a) Produce (2 marks) and interpret (2 marks) the correlation and covariance matrices
(2 marks). Explain the difference between these matrices in detail (i.e. explain
clearly how the values are adjusted mathematically and the effect of these changes)
(2 marks). Would using the covariance matrix in PCA on this data be appropriate (1
mark)? Why (1 mark)? (8 marks total)
b) Perform PCA analysis on the 4 employment variables using the prcomp function.
Provide the eigenvalues (1 mark), %variation (1 mark) and scree plot (1 mark).
Interpret each of these results (3 marks) and discuss how they influence your
decision on how many PCs to interpret from this analysis (2 marks). Remember to
keep in mind the overall purpose of PCA (8 marks total).
c) Interpret (2 marks) the first PC. Include the Z equation (1 mark) and a plot of the
loadings on the first PC in your answer (1 mark). (4 marks total)
d) What is the correlation between the first and second PCs and what does this tell you?
(2 marks total)
e) Produce (1 mark) and interpret (2 marks) a biplot based on the first 2 PCs. In
particular, explain your interpretation of the employment variables in country 19
compared to country 9 (1 mark). Relate your interpretation back to the original data
(1 mark). (5 marks total)
f) Was this a useful analysis for this data set? Explain with specific reference to the
results of your prior analysis in this question. (3 marks total)
Page 4 of 5
Question 4 (25 marks):
For all of question 4, do not use all nine employment variables – use only those 4
identified in Question 1 as UVN.
Provide R code, output and written interpretation for parts a) to e) of this question.
a) Perform a Factor Analysis using the factanal function. Initially use the number of
components you identified as informative in Question 3 (do not use parallel analysis
to help inform your decision here) and apply no rotation. You will get an error
message. In order to problem solve this issue and make further decisions about your
analysis you will need to have read the additional notes available in the Week 6 block
on the Studydesk called “notes on df limiting number of factors.pdf”. Provide your
initial line of code, subsequent error message and your final line of code that
successfully performs the factanal analysis (2 marks). What did you need to change
and why (2 marks)? (4 marks total)
b) From your successful factanal analysis in part a) provide output and interpretation for
(8 marks total):
• Variance explained (2 marks)
• Chi-square test (2 marks)
• Variable loadings (2 marks)
• Difference in uniqueness values for the variables FIN and SPS (2 marks)
c) How would your results change if you applied a rotation? Explain your reasoning. (4
marks total)
d) Perform parallel analysis using a seed value of 245 and 500 iterations. Produce the
scree plot for the PC results only (1 mark). Discuss how many PC’s are recommended
by this analysis and use the plot to help you explain these results (2 marks). As part
of your explanation provide the values for the 95th percentile for components 1 and 2
(1 mark). (4 marks total)
e) Explain in your own words how the parallel analysis works. (5 marks total)
Page 5 of 5