0% found this document useful (0 votes)
8 views

Final Exam 2017

Uploaded by

dong000416
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Final Exam 2017

Uploaded by

dong000416
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Venue _________________________________________

STUDENT
NUMBER U

Research School of Finance, Actuarial Studies and Statistics


EXAMINATION
Semester 1 – Final, 2017

STAT2008 Regression Modelling


Examination/Writing Time Duration: 180 minutes
Reading Time: 15 minutes
Exam Conditions:
Central Examination. This examination paper is not available to the ANU Library archives.
Students must return the examination paper at the end of the examination.
Materials permitted in the exam venue: (No electronic aids are permitted e.g. laptops, phones)
Unannotated paper-based dictionary (no approval required),
One A4 page with notes on both side, Calculator
Materials to be supplied to Students:
Scribble Paper
Instructions to Students:
1. This examination paper comprises a total of twenty (20) pages and there is a separate handout of
R output which also has a total of twenty (20) pages. During the reading time preceding the exam,
please check that both documents have the correct number of pages.
2. All answers are to be written on this exam paper, which is to be handed in at the end of the exam.
You may make notes on scribble paper (or on the R handout) during the reading time, but
do NOT write on this exam paper until after the start of the writing time. If you need additional
space, use the rear of the previous page and clearly indicate the part of the question that your
answer refers to. The R handout and any scribble paper will be collected at the end of the
examination and destroyed, they will not be marked.
3. There are a total of four questions, which are worth 15 marks each, for a total of 60 marks.
The parts of each question are of unequal value, with the marks indicated for each part.
You should attempt to answer each and every part of all four questions. This examination
counts towards 60% of your final assessment.
4. Please write your student number in the space provided at the top of this page.
5. Include a clear statement of the formulae you use to answer each question.
6. Statistical tables (generated using R) are provided on pages 19 and 20 at the end of the handout of
R output. Unless otherwise indicated, use a significance level of 5% and note that log x refers to the
natural logarithm of x.

Q1 Q2 Q3 Q4 Total
Pages 2 to 6 7 to 11 12 to 15 16 to 20
Marks 15 15 15 15 60

Score

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 1 of 20
Question 1 (15 marks)
The faraway library includes a data frame called cheddar, which contains data from a study of
cheddar cheese from the La Trobe Valley in Victoria. The concentration of Lactic acid, along
with the concentrations (on a log scale) of both Acetic acid and H2S (hydrogen sulphide) were
measured from 30 samples of cheese, which were then subjected to taste tests. Overall taste
scores were obtained by combining the scores from several tasters.
(a) A multiple regression model (cheddar.lm) has been fitted to these data and the summary
output from this model is given at the top of page 2 of the R output, but the analysis of
variance (ANOVA) table is not shown. Fill in the details of the ANOVA table in the
spaces shown below:
Df Sum Sq Mean Sq F value Pr(>F)

H2S

Lactic

Residuals

[Hint: rounding errors will accumulate as you derive entries in this table from other
values shown in the R output, so do NOT round the results of intermediate
calculations. DO round all your final answers in the above table to 2 decimal places.
You may also have to use the statistical tables to estimate one or more of the
p-values, or you can receive the marks for showing appropriate critical values.]

Working

(3 marks – 1 for each row of the ANOVA table)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 2 of 20
Question 1 continued
(b) Residual plots for the model in part (a) are shown on pages 2 and 3 of the R output.
Do these plots suggest any problems with the underlying assumptions?
Are there any problem(s) shown on the “Residuals vs Fitted” plot on page 2?
If so describe the problem(s):

Are there any problem(s) shown on the “Normal Q-Q” plot on page 3?
If so describe the problem(s):

Are there any problem(s) shown on the “Cook’s distance” plot on page 3?
If so describe the problem(s):

What is your overall assessment? (select just ONE of the following options)
□ Residuals are not independent (obvious pattern)
□ Residuals do not have constant variance (heteroscedasticity)
□ Residuals are not normally distributed
□ There are possible outliers and/or influential observations
□ More than one of the above problems
□ No obvious problems
(2 marks – 0.5 for each section)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 3 of 20
Question 1 continued
(c) For each of the following five diagnostic measures shown on page 4 of the R output,
calculate the relevant cut-off value suggested in the lecture notes and discuss whether or
not this cut-off is appropriate in this instance. Which observations, if any, exceed each
of the cut-off values?
The leverage or hat values (hii)

The externally studentised residuals (ti)

DFFITS

(see the next page for more answer spaces for part (c) of Question 1)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 4 of 20
Question 1, part (c) continued

COVRATIO

DFBETAS

Given your answers above and considering the residual plots in part (b), are there
any observations that are vertical outliers and/or highly influential observations?
Should some observations be removed and the model re-fit to the remaining data?

(7 marks – 1 for each of the first 5 sections and 2 for the last summary section)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 5 of 20
Question 1 continued
(d) Output for a second model (cheddar.lm2) is shown on page 5 of the R output, which
includes an additional term added to the initial model described in the earlier parts of
this question. Is the term involving Acetic a significant addition to a model which
already includes H2S and Lactic? Give full details of an appropriate hypothesis test.

(3 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 6 of 20
Question 2 (15 marks)
The US Centers for Disease Control and Prevention (CDC) use data from the National Health
and Nutrition Examination Survey (NHANES) to develop a series of clinical growth charts
for assessing healthy growth ranges in boys and girls. The data frame kid.weights in the UsingR
library contains a sample of 250 observations taken from the NHANES data. The data frame
contains the age (in months), weight (in pounds) and height (in inches) for 129 girls (gender =
F) and 121 boys (gender = M), with age ranging from 3 months to 144 months (12 years).
(a) Page 6 of the R output shows code used to fit a series of models to these data. Residual
plots are given on page 7 for growth.lm3, the last of this series of models. Do these plots
suggest any problems with the underlying assumptions for model growth.lm3?
Are there any problem(s) shown on the “Residuals vs Fitted” plot on page 7?
If so describe the problem(s):

Are there any problem(s) shown on the “Normal Q-Q” plot on page 7?
If so describe the problem(s):

Are there any problem(s) shown on the “Cook’s distance” plot on page 7?
If so describe the problem(s):

What is your overall assessment? (select just ONE of the following options)
□ Residuals are not independent (obvious pattern)
□ Residuals do not have constant variance (heteroscedasticity)
□ Residuals are not normally distributed
□ There are possible outliers and/or influential observations
□ More than one of the above problems
□ No obvious problems
(2 marks – 0.5 for each section)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 7 of 20
Question 2 continued
(b) On page 8 of the R output, there is also some summary output for the model growth.lm3,
including a few residual diagnostics. Use this summary output and your answers to part
(a) to comment on the following issues:
Observations 228, 9 and 158 were highlighted in some of the residual plots. Which
of the diagnostics on page 8 could you use to test if these observations are vertical
outliers? Are these observations really outliers or do they suggest some other
problem with the underlying assumptions?

Is growth.lm3 an appropriate model for the kid.weights data? If not, how might we
modify this model?

(2 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 8 of 20
Question 2 continued
(c) In the summary(growth.lm3) output on page 8 of the R handout, most of the summary
statistics and the partial regression coefficient for the interaction term boy:height have
been removed and replaced by question marks. Calculate all five missing statistics.
[Show all necessary formulae and working and round your final answers to no more
than 3 significant figures, as rounding errors will accumulate.]
Estimated coefficient for the boy:height term

The residual standard error and the corresponding degrees of freedom

Multiple R-squared

Adjusted R-squared

The F-statistic and the corresponding degrees of freedom

(5 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 9 of 20
Question 2 continued
(d) The indicator variable boy is equal to 1 for each male observation and is 0 otherwise
(when the observation is a girl). This indicator variable was created at the end of page 6
of the R output and was included in the model growth.lm3.
The model growth.lm2 is also shown on page 6 of the R output, but has been turned
into a comment, so that the output for this model is not shown. What does the model
growth.lm2 suggest is the form of the relationship between weight and the
explanatory variables included in that model? What would have been the effects of
adding the indicator variable boy to the model growth.lm2 as just an additive term
(i.e. not including any interaction terms)?

Now examine the way in which the indicator variable boy has actually been added to
the model growth.lm2 to create the model growth.lm3. What are the effects of this
approach on the form of the relationship between the variables? Does the summary
output for the model growth.lm3 on page 8 of the R output suggest that the weight
growth curves for boys and girls differ by an additive constant; or a multiplicative
constant; or that completely separate curves are required?

(2 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 10 of 20
Question 2 continued
(e) In the summary(growth.lm3) output on page 8, some of the partial regression coefficients
do not have any “stars” next to their p-values. Does this mean the relevant terms should
be removed from the model? Discuss each of the terms that have no “stars” and explain
why that term should or should not be removed.

(3 marks)
(f) In the vif(growth.lm3) output on page 8, some of the variance inflation factors are
relatively large. Is this an issue that suggests some changes need to be made to the
model? Why or why not?

(1 mark)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 11 of 20
Question 3 (15 marks)
When you attach the UsingR library (from the recommended Verzani text) in R, a number of
other libraries are also attached; the first of which is the MASS library, where MASS is short for
the title of the 2002 book by Bill Venables and Brian Ripley “Modern Applied Statistics with
S-PLUS” (yet another text which has been recommended in this course in previous years).
The data frame cement in the MASS library contains information on the setting of thirteen
samples of cement in Portland, Oregon in the US. For each sample, the percentages of the
four main chemical ingredients were accurately measured (x1 = tricalcium aluminate,
x2 = tricalcium silicate, x3 = tetracalcium alumina ferrate, and x4 = dicalcium silicate). While
the cement samples were setting, the amount of heat evolved was also measured (this is the
response variable, y, measured in calories/g).
(a) Pages 9 and 10 of the R output show a scatterplot matrix, a correlation matrix and other
output for the cement data. Comment on the relationships between the variables and
possible implications for fitting a multiple linear regression model with y as the
response variable and including all four of the possible explanatory variables, x1 to x4.

(2 marks)
(b) Pages 10 and 11 of the R output present output for a model, cement_all.lm, which
includes all four of the explanatory variables and for another model, cement_all.lm2,
which has the same four explanatory variables, but in a different order. The anova( )
tables are shown for both models, but the output from plot( ), summary( ) and vif( ) are
only shown for the first model. How would the plot( ), summary( ) and vif( ) output differ
for the second model (as opposed to the output shown for the first model)?

(1 mark)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 12 of 20
Question 3 continued
(c) Residual plots for the model (cement_all.lm) are shown on page 11 of the R output. Do
these plots suggest any problems with the underlying assumptions?
Are there any problem(s) shown on the “Residuals vs Fitted” plot on page 10?
If so describe the problem(s):

Are there any problem(s) shown on the “Normal Q-Q” plot on page 10?
If so describe the problem(s):

Are there any problem(s) shown on the “Cook’s distance” plot on page 10?
If so describe the problem(s):

What is your overall assessment? (select just ONE of the following options)
□ Residuals are not independent (obvious pattern)
□ Residuals do not have constant variance (heteroscedasticity)
□ Residuals are not normally distributed
□ There are possible outliers and/or influential observations
□ More than one of the above problems
□ No obvious problems
(2 marks – 0.5 for each section)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 13 of 20
Question 3 continued
(d) Compare the significance of the terms involving the explanatory variables in the
ANOVA table and summary output for model (cement_all.lm) and in the ANOVA table
for model (cement_all.lm2) presented on page 10 of the R output. Discuss the problem
suggested by these comparisons. Is there some other output that confirms this problem?

(2 marks)
(e) Present full details of a nested F test to test whether or not the variables x2 and x3 are a
significant addition to a model that already includes x4 and x1.

(3 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 14 of 20
Question 3 continued
(f) Output for a modified model (cement.lm) is presented on page 12 of the handout of R
output. Use this output to discuss whether or not the modifications appear to have
solved the problem with the earlier models identified in part (d). What other output
should you check to assess the fit of the model (cement.lm)?

(2 marks)
(g) Find 95% confidence intervals for each of the partial regression coefficients in the
model (cement.lm). Interpret the values of these partial regression coefficients.

(3 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 15 of 20
Question 4 (15 marks)
As discussed in Question 1 of Tutorial 4, the brains example from lectures was actually a
subset of data from a larger study, which was conducted to study the need for sleep in various
species of mammals. Data from the larger study are available in the data frame mammalsleep
in the faraway library, which includes the following variables: brain weight (g); body weight
(kg); gestation (days); lifespan (years); danger (a score which can be summarised as 1 if the
mammal is at a high level of danger from other animals when sleeping and 0 if the danger is
relatively low); and sleep (the total time spent sleeping per day in hours). In mammalsleep,
there are some missing values for sleep, lifespan and gestation, which leaves 51 species of
mammals for which we have typical values for all 6 variables.
(a) The process of extracting the data for modelling is shown on page 13 of the R output
and the final data for modelling are shown on pages 14 and 15. I have applied a natural
log (to the base e) transformation to all of the continuous variables. What is the purpose
of this log transformation and does it appear to be a sensible approach in this instance?

(1 mark)
(b) Page 16 of the R output shows the results of applying the step( ) function to suggest a
suitable multiple linear regression model for these data. Briefly describe the process of
model refinement that has been applied here.

(1 mark)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 16 of 20
Question 4 continued
(c) Residual plots for the model suggested by the step( ) function are shown on page 17 of
the R output. Do these plots suggest any problems with the underlying assumptions?
Are there any problem(s) shown on the “Residuals vs Fitted” plot on page 17?
If so describe the problem(s):

Are there any problem(s) shown on the “Normal Q-Q” plot on page 17?
If so describe the problem(s):

Are there any problem(s) shown on the “Cook’s distance” plot on page 17?
If so describe the problem(s):

What is your overall assessment? (select just ONE of the following options)
□ Residuals are not independent (obvious pattern)
□ Residuals do not have constant variance (heteroscedasticity)
□ Residuals are not normally distributed
□ There are possible outliers and/or influential observations
□ More than one of the above problems
□ No obvious problems
(2 marks – 0.5 for each section)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 17 of 20
Question 4 continued
(d) Which mammals have been identified in each of the residual plots in part (c)? Find the
species name for the relevant observations in the listing of the data on page 15 of the R
output. Discuss any potential problems with these observations.

(3 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 18 of 20
Question 4 continued
(e) Which is the only explanatory variable which has not been included in the suggested
model (msleep.lm)? Looking back to the scatterplot and correlations matrices on page 14
of the R output, can you suggest a reason why this variable was excluded?

(1 mark)
(f) Page 18 of the R output shows some summary output for the model (msleep.lm). What
do the signs of each of the partial regression coefficients suggest about the expected
amount of time spent sleeping?

(3 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 19 of 20
Question 4 continued
(g) The correlation between log_sleep and log_lifespan was negative, so why does
log_lifespan have a positive partial regression coefficient in the suggested model?
Is log_lifespan a significant addition to a model that already includes log_gestation,
danger and log_body?

(2 marks)
(h) Under the suggested model, what is the expected difference in the daily hours spent
sleeping, between mammals that are in danger and those that are relatively safe? Find a
95% confidence interval for this difference.

(2 marks)
END OF EXAMINATION

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling


Page 20 of 20

You might also like