0% found this document useful (0 votes)
74 views

Assignment 2 2020

The document provides instructions for Assignment 2, which is due on May 21st at 11:55pm and involves analyzing a data set called 'psy_grades.txt' containing psychological and academic variables for 300 students. Students must submit one PDF file answering 5 questions, showing all code, output, and interpretations. Marks are allocated for each question. The data contains psychological, academic, and demographic variables for each student. Students are asked to perform various analyses including correlations, canonical correlation analysis, discriminant function analysis, and distance matrices to examine relationships between variables.

Uploaded by

Babi Feed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views

Assignment 2 2020

The document provides instructions for Assignment 2, which is due on May 21st at 11:55pm and involves analyzing a data set called 'psy_grades.txt' containing psychological and academic variables for 300 students. Students must submit one PDF file answering 5 questions, showing all code, output, and interpretations. Marks are allocated for each question. The data contains psychological, academic, and demographic variables for each student. Students are asked to perform various analyses including correlations, canonical correlation analysis, discriminant function analysis, and distance matrices to examine relationships between variables.

Uploaded by

Babi Feed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Assignment 2

Total Marks: 100; Weighting: 30%


Due: 21/5/20 11.55pm

Instructions:
 Submit only one file in pdf format to the link on the Study Desk.
 This Assignment includes material up to and including Cluster Analysis
 Assume that your report will be read by someone familiar with the data sets but with
limited statistical knowledge. Fully explain plots and when stating statistics or results
explain what they mean statistically AND in context of the data.
 Unless otherwise stated you should include R code, output and interpretation for all
analysis questions.
 Presentation should be neat, consistent, spell-checked and proof read. All questions
should be clearly labelled, and all answers should clearly and concisely address the
questions.
 If you convert a Word document to pdf for submission check that all symbols,
equations etc. have converted correctly, i.e., proof-read your work.
 If you do not use knitr to compile your submission, where asked to provide R code,
paste relevant code within the assignment document and italicise (or otherwise
highlight or distinguish from other content). Do not include code in an appendix.
 Do not include an appendix at all. Any work included in an appendix will not be
marked.
 Please note that referencing text books and other resources is not the goal of this
assessment. This work requires students to demonstrate their understanding of the
analysis and interpretation, not provide quotes from resources.
 When interpreting output, you are expected to do so in context of the data and the
method (i.e. ensure you comment on aspects of the method that affect your
interpretation with the respect to the variables and sample).
 A maximum of 10 marks will be deducted from your total marks for poor
presentation.

Marks:
 Question 1: 30
 Question 2: 25
 Question 3: 15
 Question 4: 25
 Question 5: 5

Page 1 of 6
Data file: Only one data set will be used for all questions in this assignment.
The data file ‘psy_grades.txt’ contains data measuring three psychological variables and
four academic variables (grades) for 300 students. The sex of each student and the school
they attend was also recorded.
Psychological variables:
• control: locus of control is the degree to which people believe that they have control
over the outcome of events in their lives.
• self: self-concept is a collection of beliefs about oneself that includes elements such
as academic performance.
• motive: motivation is a measure achievement motivation.
Academic variables:
• english: grade of performance over 1 year.
• history: grade of performance over 1 year.
• maths: grade of performance over 1 year.
• biology: grade of performance over 1 year.
Other variables:
• sex: 0=Male and 1=Female
• school: school 1, 2 or 3
• the row labels are the student ID number
Assume all variables meet MVN and other test assumptions for the purpose of these
assessment questions.

Question 1 (30 marks):


Note: When you first import the data please name your dataframe ‘pg’.

(a) Based on standardised variables produce and comment on 3 separate pairwise


correlation matrices:
i. correlation between the 3 psychological variables (1 mark);
ii. correlation between the 4 academic variables (1 mark);
iii. correlation between the 3 psychological variables and the 4 academic variables
(1 mark).
Do these correlation matrices suggest that canonical correlation would be an appropriate
form of analysis and why (2 marks)? (5 marks total)

Page 2 of 6
(b) Perform a canonical correlation on this data set for the standardised variables X1 to X3
(control, self, motive) and Y1 to Y4 (english, history, maths, biology). Provide code and
relevant output, definitions and interpretations for: (12 marks total)
i. canonical correlations (also explain why canonical correlations become
successively weaker but do not add up to one) (4 marks);
ii. chi-square test of significance and Rao’s F approximation significance test (4
marks);
iii. redundancy coefficients for the variance in the Y set of variables explained by the
variance in the X set (4 marks).
[Note: ‘relevant’ in this question requires you to select the appropriate parts of the
output from your analysis to address each dot-point – do not include all R output for this
analysis].
(c) Provide the relevant output and the equations that describe the first canonical function
using your analysis solution from part b) (1 mark). Provide the relevant output and
interpret the canonical loadings from part b) (2 marks), and discuss the value of the
analysis overall and any cautions in interpretation that should be noted (2 marks). (5
marks total)
(d) Provide the output (1 mark) from part b) showing the eigen values and interpret (1
mark). Explain the relationship between eigen values and canonical correlations (1
mark). (3 marks total)
(e) Why is canonical correlation an appropriate technique for this analysis and not multiple
regression or MANOVA? (2 marks total)
(f) Identify at least 3 limitations associated with canonical correlation analysis? (3 marks
total)

Question 2 (25 marks):


Determine if the school of a student can be predicted by their grades across the four
Academic variables by completing the following:

(a) Produce (1 mark) and interpret (3 marks) pair-wise scatter plots using the ‘splom’
function (see Week 7) for all four of the academic variables, distinguishing between
schools using colour. (4 marks total)
(b) Create training and test sets using school as a factor (you will need to convert it to a
factor first), with a 70/30 split and a seed value of 1125. Use the table function in R to
provide the number of students in each school for both the training and the test set that
you have constructed. (2 marks total)
(c) How would increasing the training/split to 80/20 potentially affect your results? (do not
perform this analysis) (2 marks total)

Page 3 of 6
(d) Perform a DFA using the training set. Explain why there are only two DFs calculated (1
mark). Provide output, definition, and interpretation (in context of the data and method)
for each of the following: (10 marks total)
i. the prior probabilities (3 marks)
ii. the trace values (3 marks)
iii. the weightings/coefficients on LD1 and LD2 (3 marks)
(e) Based on the DFA, predict school membership for the test set and create and interpret a
table showing observed vs predicted for the test set (2 marks). Create an x-y plot of the
two DFs grouped by the original school labels and another by the predicted school labels
(2 marks). Indicate on the 2nd plot the school 3 students who were misclassified as
school 2 (1 mark). (5 marks total)
(f) Why would we expect the misclassification rate for the training set be lower than for the
test set? (2 marks total)

For questions 3 and 4 you will use a subset of the psy_grades.txt data set. Run the
following code, ensure you use the specified seed value, and use the dataframe created,
‘pg_new’, for all analysis in Question 3 and Question 4.

> set.seed(24358)
> pg_new<- pg[sample(1:nrow(pg), 20, replace=FALSE),]
> table(pg_new$school)
1 2 3
7 7 6
> str(pg_new)
'data.frame': 20 obs. of 9 variables:
$ control: num -0.84 0.06 0.22 -0.38 0.71 0.75 -0.4 0.04 0.46 0.96 ...
$ self : num -0.57 0.03 0.03 0.34 0.03 1.19 0.03 0.03 0.03 0.63 ...
$ motive : num 0.33 1 0.33 0.67 1 1 1 0.67 0.67 1 ...
$ english: num 33.6 41.6 52.1 38.9 54.8 60.1 44.2 65.4 52.1 65.4 ...
$ history: num 33.3 54.1 54.7 28.1 61.2 61.9 54.1 51.5 56.7 64.5 ...
$ maths : num 41 41.2 49.5 35.3 53.7 67.1 59.3 61.2 53 70.3 ...
$ biology: num 36.3 41.7 53.6 39 48.8 49.8 58 68.8 47.1 66.1 ...
$ sex : int 0 1 1 0 0 1 1 0 1 0 ...
$ school : int 3 3 2 3 2 2 1 1 2 1 ...
> row.names(pg_new)
[1] "176" "296" "14" "59" "191" "55" "195" "87" "33" "116"
[11] "156" "271" "90" "44" "64" "192" "284" "76" "99" "133"

If your result from row.names is a different set of numbers, please paste these into the
start of your Q3 results.

Page 4 of 6
Question 3 (15 marks):
Provide R code, output and written interpretation for all analyses.
All analysis in this question should be based on the 20 observations in you ‘pg_new’
dataframe.
(a) Create a distance matrix for students based on the standardised psychological
measures using Euclidian distance. Show only the lower triangle of distances and do not
include the diagonal zero’s. Limit all values to 2 decimal places. Label the rows and
columns with the school number (1, 2 or 3) associated with each student. [Note: If you
are using Word, try reducing font size and try Verdana font to make the matrix fit on
the page] (3 marks total)
(b) Create the same distance matrix as in part a) but label the rows and columns by the
original student ID number from the ‘pg_new’ dataframe. (2 marks total)
(c) Based on the psychological measures which student is student 271 most dissimilar to (1
mark), and what school do they come from (1 mark)? (2 marks total) [Note: if you do
not have student 271 in your pg_new sample please use the 12th student in your list
from >row.names(pg_new)].
(d) Repeat the analysis in parts a) (1 mark), b) (1 mark) and c) (2 marks) for the
standardised academic measures. (4 marks total)
(e) Perform a Mantel’s test between the distance matrices for the academic and
psychological measures (1 mark). State the purpose of the Mantel’s test (2 marks) and
interpret the results (1 mark). (4 marks total)

Question 4 (25 marks):


All analysis in this question should be based on the 20 observations in your ‘pg_new’
dataframe.

(a) Based on the standardised psychological and academic variables perform a cluster
analysis using Euclidian distances and Nearest-Neighbour linkage. Plot a dendrogram
based on this cluster analysis and label the tips of the dendrogram branches by school
(3 marks). Can you identify a good place to cut this tree? Explain your reasoning (3
marks). (6 marks total)
(b) Four alternative distance measures are listed below. Choose the one you think most
appropriate for this data and explain why (2 marks). Provide an explanation for each of
the other three (2 marks each) for why they are not appropriate (or are less
appropriate) for this data. (8 marks total)
• Minkowski
• Binary

Page 5 of 6
• Manhattan
• Maximum (or minimax)
(c) Using your chosen distance from part b) repeat the cluster analysis and provide the
dendrogram (1 mark). Has this improved the usefulness of the dendrogram (1 mark)?
Explain (3 marks). (5 marks total)

(d) Repeat the analysis in part a) for the psychological measures only. Repeat again for the
academic measures only. Provide only the dendrograms in your solution (not code or
distance matrices) (2 marks). Comment on their usefulness, including clustering of
schools and branch lengths (4 marks). (6 marks total)

Question 5 (5 marks)
Write 100 to 300 words explaining, in context, whether any of these forms of analysis have
helped your understanding of the data. Be specific about what the different forms of
analysis have shown you, but do not restate results.

Page 6 of 6

You might also like