0% found this document useful (0 votes)
233 views14 pages

Stat 252-Practice Final-Greg-Solutions

The document provides instructions for a practice final exam in STAT 252 at the University of Alberta. It outlines 10 instructions for students taking the exam, including that it is closed book, calculators are permitted, phones should be turned off, there are 14 pages to the exam, and students have 3 hours to complete it. The exam is out of 64 total marks.

Uploaded by

deep81204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
233 views14 pages

Stat 252-Practice Final-Greg-Solutions

The document provides instructions for a practice final exam in STAT 252 at the University of Alberta. It outlines 10 instructions for students taking the exam, including that it is closed book, calculators are permitted, phones should be turned off, there are 14 pages to the exam, and students have 3 hours to complete it. The exam is out of 64 total marks.

Uploaded by

deep81204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

University of Alberta

Department of Mathematical and Statistical Sciences

STAT 252

PRACTICE FINAL

Instructor: Greg Wagner

SOLUTIONS

Student Name: ______________________________________

Signature: __________________________________

Instructions: (READ ALL INSTRUCTIONS CAREFULLY.)

1. This is a closed book exam.

2. You are permitted to use a NON-PROGRAMMABLE calculator, and the formula sheets and tables
provided.

3. Please turn off your cellular phones or pagers.

4. You have 3 hours to complete the exam.

5. The exam is out of a total of 64 marks.

6. This exam has 14 pages (including this cover and all computer output tables). Please ensure that
you have all pages.

7. Make sure your name and signature are on the front and your student ID number is at the top of
page two.

8. For questions that state you should show all steps, be sure that you do this in order to obtain full
credit. Conclusions must also be clearly stated. Your answers must have adequate justification.

9. For questions that state that you do not need to show all steps, read the question carefully and follow
the exact instructions regarding what is required.

10. If you run out of space in the blank area provided, use the back of the page to complete your answers
as needed and label such answers so that is clear which question they belong to.

11. Also use the reverse sides of the pages for all rough work.

12. When referring to “log”, I am always referring to the natural log.

BEST WISHES!!

1
Student ID Number: ___________________

Question 1 (2 marks): What is an indicator or dummy variable? What is its application in regression?

An indicator variable is a categorical variable that has been turned into a quantitative variable so that it
can be included in a regression model. This is done by coding the levels of the categorical variable with a
0 and 1 (if there are only two levels) or with a combination of 0’s and 1’s (if there are more than two
levels). When applied in regression, dummy variables can be used to test whether categorical variables
(represented by the indicator variables) are useful in making predictions about the response variable.

Question 2 (Two parts totaling 5 marks): A randomized experiment was conducted on washing hands
using four different methods and determining subsequent bacterial counts. The output below is from an
ANOVA F-test which resulted in rejecting the null hypothesis and concluding that there is a difference in
the bacterial counts after using the four methods of washing hands.

SUMMARY
Groups Count Sum Average Variance
Just water 8 936 117 969.1429
Alcohol (65%) 8 300 37.5 705.4286
Anti-bacterial soap 8 740 92.5 1760.857
Regular soap 8 848 106 2205.143

ANOVA
Source of Variation SS df MS F P-value
Between Groups 29882 3 9960.667 7.064 0.0011
Within Groups 39484 28 1410.143
Total 69366 31

(a) (3 marks): Using the Bonferroni method at the 94% confidence level, determine the individual
comparison-wise error rate (  I ), find the critical value (two-sided) from the appropriate statistical
table and calculate the margin of error. You do not need to perform all the steps. [Note: You only
have to calculate the margin of error once since the sample sizes are equal for all treatment groups.]

The number of multiple comparisons (m) that are possible is:


k (k − 1) 4(4 − 1)
m= = =6
2 2
 0.06
I = F = = 0.01
m 6
The Critical value of t at df = n – k = 32 – 4 = 28 for  I /2 = 0.005 is: t28,0.005 = 2.763
1 1
MEij = tn−k , I /2  MS ERROR +
ni n j
1 1
ME = 2.763  1410.143 + = 2.763  37.552  0.5 = 51.878
8 8

2
(b) (2 marks): Develop a linear combination (contrast) to test whether there is a difference between using
alcohol versus the other three methods combined. However, you do not need to perform all steps of a
hypothesis test; just develop the contrast and calculate the estimate of the contrast.

2 (1 + 3 + 4 )
 alcohol −Others = −
1 3
1 1 1
 alcohol −Others = 2 − 1 − 3 − 4 [Sum of the coefficients = 0]
3 3 3
1 1 1
Estimate of the contrast is: ˆ = (37.5) − (117) − (92.5) − (106) = −67.6667
3 3 3
[Optionally, the coefficients could be: +3, -1, -1, -1 and the estimate of the contrast would be ˆ = 203 .
This would not make any difference to the t-statistic if a complete test was being performed.]

Question 3 (Four parts totaling 16 marks): The average saturated fat consumption (in grams) and
cholesterol level (in mg/100 ml of blood) of a random a sample of 8 men were recorded. The data obtained
fit the assumptions of simple linear regression analysis. SPSS output obtained after analysis of the data is
shown below, with some values missing from the tables. You may also need some of the following
information: x = 52.625 , y = 189.250 and S xx = 1587.875 .

Scatterplot (done with SPSS) Normal Probability Plot (done with SPSS)

Fat consumption
Residual Plot
50
Residuals

0
0 20 40 60 80 100
-50
Fat consumption

Residual Plot (done with Excel)

3
Model Summaryb

Model R R Square Adjusted R Std. Error of the


Square Estimate

1 .954a .910 .895 9.148

a. Predictors: (Constant), Fat_consumption


b. Dependent Variable: Cholesterol_level

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 5089.335 1 5089.335 60.809 .000b

1 Residual 502.165 6 83.694

Total 5591.500 7

a. Dependent Variable: Cholesterol_level


b. Predictors: (Constant), Fat_consumption

Coefficientsa

Model Unstandardized t Sig. 99.0% Confidence Interval for B


Coefficients

B Std. Error Lower Bound Upper Bound

(Constant) 95.036 12.507 7.598 0.000270 48.666 141.406


1
Fat_consumption 1.790 .230 7.798 0.000234 .939 2.641
a. Dependent Variable: Cholesterol_level

Note: The numbers highlighted in yellow were not given in the question paper.

(a) (5 marks): Using a simple linear regression ANOVA test, at the1% significance level, test whether
there is a relationship between saturated fat consumption and cholesterol level in men. In other words,
test for the significance of the slope of the regression line. Perform ALL steps of the hypothesis test.
Give both the exact P-value from the computer output and the P-value obtained from the F-table.

H0: β1 = 0 (There is no relationship between fat consumption and cholesterol level in men.)
Ha: β1 ≠ 0 (There is a relationship between fat consumption and cholesterol level in men.)
MS REGR SS REGR / (2 − 1)
F= =
MSERROR SS ERROR / (n − 2)
5089.335 /1 5089.335
= = = 60.809
502.165 / (8 − 2) 83.6942
df = (1, n - 2) = (1, 6)
From SPSS output, the Exact P-value of the F-test = P-value of the two-tailed t-test = 0.000234
Examining the F-table, P < 0.001
Since P-value < α, reject H0. So, there is extremely strong evidence against H0

At the 1% significance level, the data provide sufficient evidence to conclude that there is a relationship
between fat consumption and cholesterol level in men.

4
(b) (3 marks): Calculate the Pearson correlation coefficient for the relationship between saturated fat
consumption and cholesterol level in men. You do not need to do all the steps of a hypothesis test;
just state the correlation coefficient, the P-value (both the exact P-value from the computer output and
the P-value obtained from the r-table) and your conclusion.

SS REGR 5089.335
Coefficient of determination: R2 = = = 0.91019
SSTOTAL 5591.500

Pearson correlation coefficient: r = + 0.91019 = 0.954

From SPSS output, P-value of correlation test = P-value of the two-tailed t-test = 0.000234
Examining the table for Pearson correlation coefficient at df = n – 2 = 6, P < 0.001

There is extremely strong evidence that there is correlation between saturated fat consumption and
cholesterol level in men.

(c) (4 marks): Since it is fairly common knowledge that high saturated fat consumption increases
cholesterol level, perform a regression t-test, at the 1% significance level, to test the hypothesis that
there is a positive relationship between saturated fat consumption and cholesterol level in men.
Perform ALL steps of the hypothesis test. Give both the exact P-value from the computer output and
also the P-value obtained from the t-table.

Ho:  = 0 (There is no relationship between saturated fat consumption and cholesterol level in men.)
Ha:   0 (There is a positive relationship between saturated fat consumption and cholesterol level in
men.)
ˆ1 ˆ1 1.790
t= = = = 7.783
ˆ / S xx SE ( ˆ1 ) 0.230
[Note: There is a slight rounding error since SPSS output gives t = 7.798]

Since the slope is positive, the relationship is positive.

Exact P-value = 0.000234/2 = 0.000117


Using the t-table, at df = n – 2 = 6, P < 0.0005

Since P < α, reject Ho with extremely strong evidence

At the 1% significance level, the data provide sufficient evidence to conclude that there is a positive
relationship between saturated fat consumption and cholesterol level in men.

5
(d) (4 marks): Calculate a 95% confidence interval for the mean response of cholesterol level for men
whose average fat consumption is 60 g/day.

At the 95% confidence level and df = 8 – 2 = 6, tα/2 = 2.447


Given in the question: x = 52.625 and S xx = 1587.875 c

yˆ p = ˆ0 + ˆ1x p where x p = 60 g/day


yˆ p = 95.036 + 1.790(60) = 202.436

1 ( x p −  xi / n)
2
yˆ p  t /2  ˆ +
n S xx
SS ERROR 502.165
ˆ = MS ERROR = = = 83.694167 = 9.148
n−2 8−2
1 (60 − 52.625) 2
202.436  2.447  9.148 +
8 1587.875
202.436 ± 2.447 x 9.148 x 0.399066
202.436 ± 8.9332
(193.50, 211.37)

We can be 95% confident that the mean cholesterol level for men whose average fat consumption is 60
g/day is somewhere between 193.50 and 211.37 mg/100 ml of blood.

Question 4 (Two parts totaling 9 marks): An experiment was conducted to test the ultimate strength (in
MPa’s) of random samples of three types of metals (steel, alloy and titanium) produced by two methods
(Method 1 and Method 2). The following is incomplete SPSS output. [Note: This is a balanced design
where n = 42 and there are 7 replicates for each combination of the two factors.]

Tests of Between-Subjects Effects


Dependent Variable: Strength
Source Type III Sum df Mean Square F Sig.
of Squares
Corrected Model 40683.929a 5 8136.786 5.477 .001
Intercept 31037405.357 1 31037405.357 20890.561 .000
Method 48.214 1 48.214 .032 .858
Metal 30389.286 2 15194.643 10.227 .000
Method * Metal 10246.429 2 5123.214 3.448 .043
Error 53485.714 36 1485.714
Total 31131575.000 42
Corrected Total 94169.643 41
a. R Squared = .432 (Adjusted R Squared = .353)

Note: The numbers highlighted in yellow were not given in the question paper.

6
(a) (6 marks): Perform a hypothesis test, at the 1% significance level, to determine whether the overall
model is significant.

H0: All treatment means for method/metal combinations are equal. (The overall model is not useful.)
Ha: At least two means are different. (The overall model is useful.)

Overall SS (Corrected SS) = SSA + SSB + SSAB = 48.214 + 30389.286 + 10246.429 = 40683.929
(OR)
Overall SS (Corrected SS) = Corrected Total SS – Error SS = 94169.643 – 53485.714 = 40683.929

Overall model df = df(A) + df(B) + df(AB) = (a – 1) + (b – 1) + (a – 1)(b – 1) = 1 + 2 + 2 = 5


(OR)
Overall model df = ab – 1 = (2)(3) – 1 = 5

F (Overall model) = F (Corrected model)


Corrected _ SS / Corrected _ df 40683.929 / 5 8136.786
= = = = 5.477
Error _ SS / Error _ df 53485.714 / (42 − 3  2) 1485.714

df = [(ab − 1), (n − ab)] = [(3  2 − 1), (42 − 3  2)] = (5,36)

P < 0.001, Since P < α (0.01), reject Ho. So, there is extremely strong evidence against H0

Conclusion: The data provide sufficient evidence that the treatment means are not all the same for
method/metal combinations, therefore the overall model is significant.

7
(b) (2 marks): The table below shows the results of multiple comparisons for the difference in strength
between the three types of metals (Steel, Alloy and Titanium), both separately for Methods 1 and 2
and overall for the two methods combined. Firstly, construct a means comparison diagram
summarizing the results of multiple comparisons for the difference between the three types of metals,
regardless of the method used (that is, based on the means for the totals). Secondly, write a
conclusion in words about what the multiple comparisons show.

Descriptive Statistics
Dependent Variable: Strength
Method Metal Mean Std. Deviation N
Alloy 820.00 40.415 7
Steel 864.29 36.904 7
Method 1
Titanium 891.43 36.710 7
Total 858.57 47.041 21
Alloy 824.29 41.975 7
Steel 903.57 39.761 7
Method 2
Titanium 854.29 35.051 7
Total 860.71 49.932 21
Alloy 822.14 39.648 14
Steel 883.93 42.116 14
Total
Titanium 872.86 39.502 14
Total 859.64 47.925 42

Multiple Comparisons (Combining the two methods)


Dependent Variable: Strength
Tukey HSD
(I) Metal (J) Metal Mean Std. Error Sig. 95% Confidence Interval
Difference (I- Lower Bound Upper Bound
J)
Steel -61.79* 14.569 .000 -97.40 -26.18
Alloy
Titanium -50.71* 14.569 .004 -86.32 -15.10
Alloy 61.79* 14.569 .000 26.18 97.40
Steel
Titanium 11.07 14.569 .730 -24.54 46.68
Alloy 50.71* 14.569 .004 15.10 86.32
Titanium
Steel -11.07 14.569 .730 -46.68 24.54
Based on observed means.
The error term is Mean Square (Error) = 1485.714.
*. The mean difference is significant at the 0.05 level.

Means Comparisons Diagram

Alloy Titanium Steel


822.14 872.86 883.93

We can be 95% confident that there is a difference in mean strength between alloy and both steel and
titanium, but there is no difference in strength between steel and titanium.

8
Question 5 (Five parts totaling 15 marks): A marine ecologist wanted to examine the relationship
between water depth, light intensity, and diatom density (response variable). At 9 different depths in the
ocean, he recorded depth (in meters), light intensity (as percentage of the surface intensity) and diatom
density (in cells per milliliter of ocean water). The first table below shows the raw data recorded. Below that
is incomplete SPSS output of the data analysis.

Water depth Light intensity Interaction Diatom density


(m) (% of surface (cells/ml)
intensity)
1 95 95 96
5 84 420 85
10 69 690 77
20 53 1060 60
30 44 1320 52
40 37 1480 45
60 22 1320 28
80 14 1120 18
100 5 500 7

Model Summary

Model R R Square Adjusted R Std. Error of the


Square Estimate

1 .999a .999 .998 1.369

a. Predictors: (Constant), Interaction, Depth, Light_intensity

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 7510.626 3 2503.542 1335.301 .000b

1 Residual 9.374 5 1.875

Total 7520.000 8

a. Dependent Variable: Diatoms


b. Predictors: (Constant), Interaction, Depth, Light_intensity

Coefficientsa

Model Unstandardized Coefficients Standardized t Sig.


Coefficients

B Std. Error Beta

(Constant) 35.208 14.431 2.440 .059

Depth -.306 .118 -.348 -2.605 .048


1
Light_intensity .639 .152 .650 4.193 .009

Interaction -.002 .003 -.026 -.507 .634

a. Dependent Variable: Diatoms


Note: The numbers highlighted in yellow were not given in the question paper.

9
(a) (5 marks): At the 1% significance level, perform a hypothesis test to determine whether the overall
multiple regression model is significant or useful for making predictions about diatom density.

H0 : 1 = 2 = 3 = 0 Ha : At least one i is not zero

k = number of predictor variables = 3


n = 9 (representing the 9 depths where measurements were taken)

MSREGR SSREGR / k 7510.626 / 3 2503.542


F= = = = = 1335.36
MSERROR SSERROR / (n − (k + 1)) 9.374 / (9 − (3 + 1)) 1.8748

df (regression) = k = 3
df (error) = n – (k + 1) = 9 – (3 + 1) = 5
At df = (3, 5), P < 0.001
Since P < α, reject Ho with extremely strong evidence.

Conclusion: At the 1% significance level, the data provide sufficient evidence to conclude that at least
one of the population regression coefficients is not zero OR that the overall regression model is useful for
making predictions about the response variable (diatom density).

(b) (3 marks): Calculate a 95% confidence interval for the slope of the interaction term (representing
interaction between depth and light intensity). Using this confidence interval, what conclusion can you
make about the significance of the slope of the interaction term? Explain your answer.

At df = 5, t /2 = t0.05/2 = t0.025 = 2.571

ˆi  t /2  SE ( ˆi )
−0.002  2.571 0.003
( −0.0097, 0.0057)

Conclusion: Since 0 is inside this interval, we can be 95% confident that the slope of the interaction term
is not significance.

(c) (1 mark): Find the standard error of the model (standard error of the estimate of the model)?

SS ERROR 9.374
MS ERROR = = = 1.8748
n − (k + 1) 5
Standard error of the model is: ˆ = MS ERROR = 1.8748 = 1.369

10
(d) (2 marks): At a depth of 70 meters and a light intensity of 18%, suppose that the actual or observed
diatom density observed was 19.6 cells per milliliter. What was the residual or error of this
observation?

yˆ = 35.208 + (−0.306)(70) + 0.639(18) + (−0.002)(70 18) = 22.77


Residual = e = Observed − Predicted = yi − yˆ = 19.6 − 22.77 = −3.17

(e) (4 marks): Based on the values of the predictor variables given in part (d) (depth = 70 m, light = 18%,
interaction term = 1260), what is the 95% prediction interval for all single observation responses of
diatom density at those values of the predictor variables? [Note: SE(Fit) = 0.793]

At df = 5, t /2 = t0.05/2 = t0.025 = 2.571


Based on the values of the predictor variables given in part (d), yˆ p = 22.77

2
yˆ p  t /2  ˆ + [ SE ( Fit )]2
22.77  2.571 (1.369)2 + (0.793)2
22.77  4.068
(18.70, 26.84)

We can be 95% confident that the diatom density (any single observation) at the values of the predictor
variables given in part (d) between 18.70 and 26.84 cells per ml.

11
Question 6 (5 marks): A certain company wanted to analyze the relationship between total sales
(response variable) and the money they spend advertising through magazines, television, and radio. All
data were recorded in millions of dollars based on a random sample of 10 business transactions. At the
5% significance level, perform the most appropriate test, showing all steps, to determine whether
magazines have any effect on sales after accounting for the effect of TV and radio advertising.

Consider the following three models and the corresponding ANOVA tables below them:

Model 1: {Sales | magazines} = 0 + 1magazines


Model 2: {Sales | TV , radio} = 0 +  2TV + 3radio
Model 3: {Sales | magazines, TV , radio} = 0 + 1magazines +  2TV + 3 radio

ANOVA table for Model 1: Effect of magazines

ANOVAa
Model Sum of df Mean Square F Sig.
Squares
Regression 353.361 1 353.361 2.950 .124b
1 Residual 958.275 8 119.784
Total 1311.636 9
a. Dependent Variable: Sales
b. Predictors: (Constant), Magazines

ANOVA table for Model 2: Effect of Radio and TV

ANOVAa
Model Sum of df Mean Square F Sig.
Squares
Regression 1117.732 2 558.866 20.175 .001b
1 Residual 193.904 7 27.701
Total 1311.636 9
a. Dependent Variable: Sales
b. Predictors: (Constant), Radio, TV

ANOVA table for Model 3: Effect of Magazines, Radio and TV (Full Model)

ANOVAa
Model Sum of df Mean Square F Sig.
Squares
Regression 1194.529 3 398.176 20.401 .002b
1 Residual 117.107 6 19.518
Total 1311.636 9
a. Dependent Variable: Sales
b. Predictors: (Constant), Radio, TV, Magazines

Note: The numbers highlighted in yellow were not given in the question paper.

12
Solution for Question 6:
1 = slope for magazines
H0 : 1 = 0 (Model 2: {Sales | TV , radio} = 0 +  2TV + 3radio )
H a : 1  0 (Full model: Model 3:
{Sales | magazines, TV , radio} = 0 + 1magazines +  2TV + 3 radio )
df E (reduced ) = n − (k + 1) = 10 − (2 + 1) = 7
df E ( full ) = n − (k + 1) = 10 − (3 + 1) = 6
[ SSE (reduced ) − SS E ( full )] [df E (reduced ) − df E ( full )]
F=
SSE ( full ) / df E ( full )

[193.904 − 117.107] [7 − 6] 76.797 /1 76.797


= = = = 3.935
117.107 / 6 117.107 / 6 19.51783
df = ( Extra _ df , n − (k + 1)) = (1,10 − (3 + 1)) = (1, 6)
Thus, 0.10 > P > 0.05, which provides moderate evidence against the null hypothesis
Since P > α (0.05), do not reject Ho

Conclusion: At the 5% significance level, we can conclude that magazines have no effect on sales after
accounting for the effect of TV and radio advertizing.

Question 7 (Two parts totaling 3 marks): Consider the following model:

{Sales | magazines, TV , radio} = 0 + 1magazines +  2TV + 3radio +  4 (magazines  TV )


+ 5 (magazines  radio) + 6 (TV  radio) + 7 (magazines  TV  radio)

(a) (2 marks): What is the effect of magazines on sales? How would you redefine the model? No
calculations are necessary.

{Sales | magazines + 1, TV , radio} − {Sales | magazines , TV , radio}


= 1 +  4 (TV ) + 5 (radio) + 7 (TV  radio)

(b) (1 mark): What would be the null hypotheses for testing for the effect of magazines?

H 0 : 1 =  4 = 5 = 7 = 0

Question 8 (2 marks): Suppose the relationship between the annual rate of hip fractures (per 100,000
people) and age follows the following model: ˆ (ln( fractures ) | age) = −2.09 + 0.0912 age .
For an increase in age from 40 to 50 years old, what would be your interpretation regarding the rate of hip
fractures on the original scale?

An additive change of (50 – 40) 10 years in age, is associated with a multiplicative change of
e10  = e(10)(0.0912) = 2.489 in the median of the annual rate of hip fractures.

In other words, the median rate of hip fractures (per 100,000 people) at 50 years will be 2.489 times the
median rate of hip fractures at 40 years.

13
Question 9 (6 marks): All parametric hypothesis tests have assumptions about normality. However, the
specific requirements regarding normality differ from one test to the other. For each of the hypothesis
tests mentioned below, explain what the specific requirement is regarding normality. In answering this
question, do not make reference to the Central Limit Theorem, assuming that sample sizes are not large
enough to apply that theorem.

1. Two-sample t-test (independent samples)


2. Paired-sample t-test
3. One-way ANOVA
4. Two-way ANOVA
5. Simple linear regression
6. Multiple linear regression

Solution:
1. Two-sample t-test (independent samples) – The two populations being compared must be normally
distributed.
2. Paired-sample t-test – The differences between paired observations must be normally distributed.
3. One-way ANOVA – All populations being compared must be normally distributed.
4. Two-way ANOVA – For each combination of treatments, the response variable must be normally
distributed.
5. Simple linear regression – For each value of the predictor variable, the corresponding values of the
response variable must be normally distributed.
6. Multiple linear regression - For each set of values of the predictor variables in the model, the
corresponding values of the response variable must be normally distributed.

Question 10 (2 marks): There are two types of inferential statistics that can be applied to a research
problem; one type is a hypothesis test and the other is a confidence interval. What advantage does a
hypothesis test have over calculating a confidence interval? Explain your answer.

Solution:
A hypothesis tests provides the strength of the evidence against the null hypothesis, based on a P-value,
which is the probability that rejection of the null hypothesis is incorrect. On the other hand, a confidence
interval does not provide the strength of the evidence against the null hypotheses; it can only provide a
decision about whether the null hypothesis is true or not, at a specific level of confidence.

14

You might also like