Stat 252-Practice Final-Greg-Solutions
Stat 252-Practice Final-Greg-Solutions
STAT 252
PRACTICE FINAL
SOLUTIONS
Signature: __________________________________
2. You are permitted to use a NON-PROGRAMMABLE calculator, and the formula sheets and tables
provided.
6. This exam has 14 pages (including this cover and all computer output tables). Please ensure that
you have all pages.
7. Make sure your name and signature are on the front and your student ID number is at the top of
page two.
8. For questions that state you should show all steps, be sure that you do this in order to obtain full
credit. Conclusions must also be clearly stated. Your answers must have adequate justification.
9. For questions that state that you do not need to show all steps, read the question carefully and follow
the exact instructions regarding what is required.
10. If you run out of space in the blank area provided, use the back of the page to complete your answers
as needed and label such answers so that is clear which question they belong to.
11. Also use the reverse sides of the pages for all rough work.
BEST WISHES!!
1
Student ID Number: ___________________
Question 1 (2 marks): What is an indicator or dummy variable? What is its application in regression?
An indicator variable is a categorical variable that has been turned into a quantitative variable so that it
can be included in a regression model. This is done by coding the levels of the categorical variable with a
0 and 1 (if there are only two levels) or with a combination of 0’s and 1’s (if there are more than two
levels). When applied in regression, dummy variables can be used to test whether categorical variables
(represented by the indicator variables) are useful in making predictions about the response variable.
Question 2 (Two parts totaling 5 marks): A randomized experiment was conducted on washing hands
using four different methods and determining subsequent bacterial counts. The output below is from an
ANOVA F-test which resulted in rejecting the null hypothesis and concluding that there is a difference in
the bacterial counts after using the four methods of washing hands.
SUMMARY
Groups Count Sum Average Variance
Just water 8 936 117 969.1429
Alcohol (65%) 8 300 37.5 705.4286
Anti-bacterial soap 8 740 92.5 1760.857
Regular soap 8 848 106 2205.143
ANOVA
Source of Variation SS df MS F P-value
Between Groups 29882 3 9960.667 7.064 0.0011
Within Groups 39484 28 1410.143
Total 69366 31
(a) (3 marks): Using the Bonferroni method at the 94% confidence level, determine the individual
comparison-wise error rate ( I ), find the critical value (two-sided) from the appropriate statistical
table and calculate the margin of error. You do not need to perform all the steps. [Note: You only
have to calculate the margin of error once since the sample sizes are equal for all treatment groups.]
2
(b) (2 marks): Develop a linear combination (contrast) to test whether there is a difference between using
alcohol versus the other three methods combined. However, you do not need to perform all steps of a
hypothesis test; just develop the contrast and calculate the estimate of the contrast.
2 (1 + 3 + 4 )
alcohol −Others = −
1 3
1 1 1
alcohol −Others = 2 − 1 − 3 − 4 [Sum of the coefficients = 0]
3 3 3
1 1 1
Estimate of the contrast is: ˆ = (37.5) − (117) − (92.5) − (106) = −67.6667
3 3 3
[Optionally, the coefficients could be: +3, -1, -1, -1 and the estimate of the contrast would be ˆ = 203 .
This would not make any difference to the t-statistic if a complete test was being performed.]
Question 3 (Four parts totaling 16 marks): The average saturated fat consumption (in grams) and
cholesterol level (in mg/100 ml of blood) of a random a sample of 8 men were recorded. The data obtained
fit the assumptions of simple linear regression analysis. SPSS output obtained after analysis of the data is
shown below, with some values missing from the tables. You may also need some of the following
information: x = 52.625 , y = 189.250 and S xx = 1587.875 .
Scatterplot (done with SPSS) Normal Probability Plot (done with SPSS)
Fat consumption
Residual Plot
50
Residuals
0
0 20 40 60 80 100
-50
Fat consumption
3
Model Summaryb
ANOVAa
Total 5591.500 7
Coefficientsa
Note: The numbers highlighted in yellow were not given in the question paper.
(a) (5 marks): Using a simple linear regression ANOVA test, at the1% significance level, test whether
there is a relationship between saturated fat consumption and cholesterol level in men. In other words,
test for the significance of the slope of the regression line. Perform ALL steps of the hypothesis test.
Give both the exact P-value from the computer output and the P-value obtained from the F-table.
H0: β1 = 0 (There is no relationship between fat consumption and cholesterol level in men.)
Ha: β1 ≠ 0 (There is a relationship between fat consumption and cholesterol level in men.)
MS REGR SS REGR / (2 − 1)
F= =
MSERROR SS ERROR / (n − 2)
5089.335 /1 5089.335
= = = 60.809
502.165 / (8 − 2) 83.6942
df = (1, n - 2) = (1, 6)
From SPSS output, the Exact P-value of the F-test = P-value of the two-tailed t-test = 0.000234
Examining the F-table, P < 0.001
Since P-value < α, reject H0. So, there is extremely strong evidence against H0
At the 1% significance level, the data provide sufficient evidence to conclude that there is a relationship
between fat consumption and cholesterol level in men.
4
(b) (3 marks): Calculate the Pearson correlation coefficient for the relationship between saturated fat
consumption and cholesterol level in men. You do not need to do all the steps of a hypothesis test;
just state the correlation coefficient, the P-value (both the exact P-value from the computer output and
the P-value obtained from the r-table) and your conclusion.
SS REGR 5089.335
Coefficient of determination: R2 = = = 0.91019
SSTOTAL 5591.500
From SPSS output, P-value of correlation test = P-value of the two-tailed t-test = 0.000234
Examining the table for Pearson correlation coefficient at df = n – 2 = 6, P < 0.001
There is extremely strong evidence that there is correlation between saturated fat consumption and
cholesterol level in men.
(c) (4 marks): Since it is fairly common knowledge that high saturated fat consumption increases
cholesterol level, perform a regression t-test, at the 1% significance level, to test the hypothesis that
there is a positive relationship between saturated fat consumption and cholesterol level in men.
Perform ALL steps of the hypothesis test. Give both the exact P-value from the computer output and
also the P-value obtained from the t-table.
Ho: = 0 (There is no relationship between saturated fat consumption and cholesterol level in men.)
Ha: 0 (There is a positive relationship between saturated fat consumption and cholesterol level in
men.)
ˆ1 ˆ1 1.790
t= = = = 7.783
ˆ / S xx SE ( ˆ1 ) 0.230
[Note: There is a slight rounding error since SPSS output gives t = 7.798]
At the 1% significance level, the data provide sufficient evidence to conclude that there is a positive
relationship between saturated fat consumption and cholesterol level in men.
5
(d) (4 marks): Calculate a 95% confidence interval for the mean response of cholesterol level for men
whose average fat consumption is 60 g/day.
1 ( x p − xi / n)
2
yˆ p t /2 ˆ +
n S xx
SS ERROR 502.165
ˆ = MS ERROR = = = 83.694167 = 9.148
n−2 8−2
1 (60 − 52.625) 2
202.436 2.447 9.148 +
8 1587.875
202.436 ± 2.447 x 9.148 x 0.399066
202.436 ± 8.9332
(193.50, 211.37)
We can be 95% confident that the mean cholesterol level for men whose average fat consumption is 60
g/day is somewhere between 193.50 and 211.37 mg/100 ml of blood.
Question 4 (Two parts totaling 9 marks): An experiment was conducted to test the ultimate strength (in
MPa’s) of random samples of three types of metals (steel, alloy and titanium) produced by two methods
(Method 1 and Method 2). The following is incomplete SPSS output. [Note: This is a balanced design
where n = 42 and there are 7 replicates for each combination of the two factors.]
Note: The numbers highlighted in yellow were not given in the question paper.
6
(a) (6 marks): Perform a hypothesis test, at the 1% significance level, to determine whether the overall
model is significant.
H0: All treatment means for method/metal combinations are equal. (The overall model is not useful.)
Ha: At least two means are different. (The overall model is useful.)
Overall SS (Corrected SS) = SSA + SSB + SSAB = 48.214 + 30389.286 + 10246.429 = 40683.929
(OR)
Overall SS (Corrected SS) = Corrected Total SS – Error SS = 94169.643 – 53485.714 = 40683.929
P < 0.001, Since P < α (0.01), reject Ho. So, there is extremely strong evidence against H0
Conclusion: The data provide sufficient evidence that the treatment means are not all the same for
method/metal combinations, therefore the overall model is significant.
7
(b) (2 marks): The table below shows the results of multiple comparisons for the difference in strength
between the three types of metals (Steel, Alloy and Titanium), both separately for Methods 1 and 2
and overall for the two methods combined. Firstly, construct a means comparison diagram
summarizing the results of multiple comparisons for the difference between the three types of metals,
regardless of the method used (that is, based on the means for the totals). Secondly, write a
conclusion in words about what the multiple comparisons show.
Descriptive Statistics
Dependent Variable: Strength
Method Metal Mean Std. Deviation N
Alloy 820.00 40.415 7
Steel 864.29 36.904 7
Method 1
Titanium 891.43 36.710 7
Total 858.57 47.041 21
Alloy 824.29 41.975 7
Steel 903.57 39.761 7
Method 2
Titanium 854.29 35.051 7
Total 860.71 49.932 21
Alloy 822.14 39.648 14
Steel 883.93 42.116 14
Total
Titanium 872.86 39.502 14
Total 859.64 47.925 42
We can be 95% confident that there is a difference in mean strength between alloy and both steel and
titanium, but there is no difference in strength between steel and titanium.
8
Question 5 (Five parts totaling 15 marks): A marine ecologist wanted to examine the relationship
between water depth, light intensity, and diatom density (response variable). At 9 different depths in the
ocean, he recorded depth (in meters), light intensity (as percentage of the surface intensity) and diatom
density (in cells per milliliter of ocean water). The first table below shows the raw data recorded. Below that
is incomplete SPSS output of the data analysis.
Model Summary
ANOVAa
Total 7520.000 8
Coefficientsa
9
(a) (5 marks): At the 1% significance level, perform a hypothesis test to determine whether the overall
multiple regression model is significant or useful for making predictions about diatom density.
df (regression) = k = 3
df (error) = n – (k + 1) = 9 – (3 + 1) = 5
At df = (3, 5), P < 0.001
Since P < α, reject Ho with extremely strong evidence.
Conclusion: At the 1% significance level, the data provide sufficient evidence to conclude that at least
one of the population regression coefficients is not zero OR that the overall regression model is useful for
making predictions about the response variable (diatom density).
(b) (3 marks): Calculate a 95% confidence interval for the slope of the interaction term (representing
interaction between depth and light intensity). Using this confidence interval, what conclusion can you
make about the significance of the slope of the interaction term? Explain your answer.
ˆi t /2 SE ( ˆi )
−0.002 2.571 0.003
( −0.0097, 0.0057)
Conclusion: Since 0 is inside this interval, we can be 95% confident that the slope of the interaction term
is not significance.
(c) (1 mark): Find the standard error of the model (standard error of the estimate of the model)?
SS ERROR 9.374
MS ERROR = = = 1.8748
n − (k + 1) 5
Standard error of the model is: ˆ = MS ERROR = 1.8748 = 1.369
10
(d) (2 marks): At a depth of 70 meters and a light intensity of 18%, suppose that the actual or observed
diatom density observed was 19.6 cells per milliliter. What was the residual or error of this
observation?
(e) (4 marks): Based on the values of the predictor variables given in part (d) (depth = 70 m, light = 18%,
interaction term = 1260), what is the 95% prediction interval for all single observation responses of
diatom density at those values of the predictor variables? [Note: SE(Fit) = 0.793]
2
yˆ p t /2 ˆ + [ SE ( Fit )]2
22.77 2.571 (1.369)2 + (0.793)2
22.77 4.068
(18.70, 26.84)
We can be 95% confident that the diatom density (any single observation) at the values of the predictor
variables given in part (d) between 18.70 and 26.84 cells per ml.
11
Question 6 (5 marks): A certain company wanted to analyze the relationship between total sales
(response variable) and the money they spend advertising through magazines, television, and radio. All
data were recorded in millions of dollars based on a random sample of 10 business transactions. At the
5% significance level, perform the most appropriate test, showing all steps, to determine whether
magazines have any effect on sales after accounting for the effect of TV and radio advertising.
Consider the following three models and the corresponding ANOVA tables below them:
ANOVAa
Model Sum of df Mean Square F Sig.
Squares
Regression 353.361 1 353.361 2.950 .124b
1 Residual 958.275 8 119.784
Total 1311.636 9
a. Dependent Variable: Sales
b. Predictors: (Constant), Magazines
ANOVAa
Model Sum of df Mean Square F Sig.
Squares
Regression 1117.732 2 558.866 20.175 .001b
1 Residual 193.904 7 27.701
Total 1311.636 9
a. Dependent Variable: Sales
b. Predictors: (Constant), Radio, TV
ANOVA table for Model 3: Effect of Magazines, Radio and TV (Full Model)
ANOVAa
Model Sum of df Mean Square F Sig.
Squares
Regression 1194.529 3 398.176 20.401 .002b
1 Residual 117.107 6 19.518
Total 1311.636 9
a. Dependent Variable: Sales
b. Predictors: (Constant), Radio, TV, Magazines
Note: The numbers highlighted in yellow were not given in the question paper.
12
Solution for Question 6:
1 = slope for magazines
H0 : 1 = 0 (Model 2: {Sales | TV , radio} = 0 + 2TV + 3radio )
H a : 1 0 (Full model: Model 3:
{Sales | magazines, TV , radio} = 0 + 1magazines + 2TV + 3 radio )
df E (reduced ) = n − (k + 1) = 10 − (2 + 1) = 7
df E ( full ) = n − (k + 1) = 10 − (3 + 1) = 6
[ SSE (reduced ) − SS E ( full )] [df E (reduced ) − df E ( full )]
F=
SSE ( full ) / df E ( full )
Conclusion: At the 5% significance level, we can conclude that magazines have no effect on sales after
accounting for the effect of TV and radio advertizing.
(a) (2 marks): What is the effect of magazines on sales? How would you redefine the model? No
calculations are necessary.
(b) (1 mark): What would be the null hypotheses for testing for the effect of magazines?
H 0 : 1 = 4 = 5 = 7 = 0
Question 8 (2 marks): Suppose the relationship between the annual rate of hip fractures (per 100,000
people) and age follows the following model: ˆ (ln( fractures ) | age) = −2.09 + 0.0912 age .
For an increase in age from 40 to 50 years old, what would be your interpretation regarding the rate of hip
fractures on the original scale?
An additive change of (50 – 40) 10 years in age, is associated with a multiplicative change of
e10 = e(10)(0.0912) = 2.489 in the median of the annual rate of hip fractures.
In other words, the median rate of hip fractures (per 100,000 people) at 50 years will be 2.489 times the
median rate of hip fractures at 40 years.
13
Question 9 (6 marks): All parametric hypothesis tests have assumptions about normality. However, the
specific requirements regarding normality differ from one test to the other. For each of the hypothesis
tests mentioned below, explain what the specific requirement is regarding normality. In answering this
question, do not make reference to the Central Limit Theorem, assuming that sample sizes are not large
enough to apply that theorem.
Solution:
1. Two-sample t-test (independent samples) – The two populations being compared must be normally
distributed.
2. Paired-sample t-test – The differences between paired observations must be normally distributed.
3. One-way ANOVA – All populations being compared must be normally distributed.
4. Two-way ANOVA – For each combination of treatments, the response variable must be normally
distributed.
5. Simple linear regression – For each value of the predictor variable, the corresponding values of the
response variable must be normally distributed.
6. Multiple linear regression - For each set of values of the predictor variables in the model, the
corresponding values of the response variable must be normally distributed.
Question 10 (2 marks): There are two types of inferential statistics that can be applied to a research
problem; one type is a hypothesis test and the other is a confidence interval. What advantage does a
hypothesis test have over calculating a confidence interval? Explain your answer.
Solution:
A hypothesis tests provides the strength of the evidence against the null hypothesis, based on a P-value,
which is the probability that rejection of the null hypothesis is incorrect. On the other hand, a confidence
interval does not provide the strength of the evidence against the null hypotheses; it can only provide a
decision about whether the null hypothesis is true or not, at a specific level of confidence.
14