0% found this document useful (0 votes)
7 views22 pages

M346 Paper 2012

The document outlines the structure and requirements for the M346/J Module Examination in Linear Statistical Modelling held on October 16, 2012. It consists of two parts: Part 1 requires students to answer one essay question from two options, while Part 2 requires answering three questions from five options, with specific mark allocations for each question. Instructions for submission and formatting of answers are also provided, emphasizing the importance of clarity and proper identification of answer books.

Uploaded by

mariuszwarczak91
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views22 pages

M346 Paper 2012

The document outlines the structure and requirements for the M346/J Module Examination in Linear Statistical Modelling held on October 16, 2012. It consists of two parts: Part 1 requires students to answer one essay question from two options, while Part 2 requires answering three questions from five options, with specific mark allocations for each question. Instructions for submission and formatting of answers are also provided, emphasizing the importance of clarity and proper identification of answer books.

Uploaded by

mariuszwarczak91
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

M346/J 

Module Examination 2012


Linear Statistical Modelling

Tuesday 16 October 2012 10.00 am – 1.00 pm

Time allowed: 3 hours

This examination is in TWO parts. Part 1 carries 25% of the total available marks
and Part 2 carries 75%.

You should attempt ONE question from Part 1: this question carries 25 marks. You
should attempt THREE questions from Part 2: each question in this part also carries
25 marks.

You are advised not to cross through any work until you have replaced it with another
solution to the same question (or part of question).

In Part 1 of the paper, if you answer both questions, your better score will count
towards your result. In Part 2 of the paper, if you answer more than three questions,
your best three scores will count towards your final mark.

This question paper is rather long because of the inclusion of tranches of GenStat
output. Do not let its length put you off. In your initial reading of the paper,
you will be able to either ignore or pass over very quickly all such output.

Please start each question on a new page, and cross out rough working.

At the end of the examination


Check that you have written your personal identifier and examination number on
each answer book used. (You may well have used only one answer book.) Failure to
do so will mean that your work cannot be identified.

Put all your used answer books together with your signed desk record on top. Fasten
them in the top left corner with the round paper fastener. Attach this question paper
to the back of the answer books with the flat paper clip.

Copyright 
c 2012 The Open University
PART 1 (Questions 1 and 2)
You should attempt ONE question from this part of the examination,
which carries 25% of the total available marks. Each question carries
25 marks. A guide to mark allocation is shown beside each question
thus: [4].
In each question in Part 1 you are asked to write a short essay on a
topic from the course. By the word ‘essay’, we do not mean to imply
that your answer should be entirely text; formulae and mathematical
symbols, if appropriate, are allowed. However, you should think of
this as an essay question in the senses of structure and readability.
Indeed, 4 of the 25 marks will be awarded for putting the essay
together in a reasonably clear manner, including a reasonable
structure with beginning, middle and conclusion, and reasonably
concise use of language. References to specific data-based examples in
the course are not expected. However, it may be useful to illustrate
points by giving special cases, perhaps in mathematical form (e.g.
Y ∼ N(0, σ2 ) is a special case of a distributional assumption, and
α + β1 x1 + β2 x2 is a special case of a formula for a regression mean).

Question 1
Write an essay in which the role of treatments and blocks in designed
experiments is discussed.
Your answer should include
• a general description of treatments and blocks in the context of
designed experiments, including a description of what they are,
why they are used and how experimental units are allocated to
them; [5]
• a description of how an ANOVA model incorporates the
treatment and block structures; [6]
• a brief description of how the best fitted model is interpreted; [2]
• a brief discussion of how four types of experiment: completely
randomised, randomised block, factorial and latin square,
compare in terms of the number of treatment factors and number
of blocking factors; [4]
• a brief description of confounding, including one advantage and
one disadvantage of having a design that makes use of
confounding. [4]
The remaining four marks are for the clarity and structure of your
essay. [4]

M346 October 2012 2


Question 2
Simple linear regression, logistic regression, and loglinear models are
three useful generalized linear models for studying relationships
between different kinds of explanatory and response data.
Write a short essay in which the links between these three generalized
linear models are considered. In your essay, you should:
• for each of the three models, describe the type of response
variable for which it is appropriate, illustrating each of your
descriptions with an example; [6]
• give the general form of the generalized linear model and explain
how each of the three models fits into this general form; [8]
• explain how the assumptions of the simple linear regression are
checked, and describe the extent to which the same procedures
are useful for logistic regression and loglinear modelling; [5]
• in situations where logistic regression and loglinear modelling are
equally applicable, give one advantage and one disadvantage of
using logistic regression over loglinear modelling. [2]
The remaining four marks are for the clarity and structure of your
essay. [4]

M346 October 2012 TURN OVER 3


PART 2 (Questions 3 to 7)
You should attempt THREE questions from this part of the
examination, which carries 75% of the total available marks. Each
question carries 25 marks. The mark allocation for each part of each
question is shown beside each part thus: [4].

Question 3
Data on a number of athletes were collected at the Australian
Institute of Sport. For each athlete, two measures were recorded in a
GenStat file: the measured haemoglobin Hg (in grams per decilitre),
and the red blood cell count RCC (×1012 cells per litre). Interest
centres on whether the response variable, Hg, can be usefully
predicted by the explanatory variable, RCC, using simple linear
regression.
(a) A scatterplot of Hg against RCC is given in Figure 1.

Figure 1
On the basis of this plot, would you say it is reasonable to fit a
simple linear regression model to the data? Briefly explain why or
why not. [3]

M346 October 2012 4


(b) The following is the GenStat output from fitting a simple linear
regression model, Model A, to the data for the Australian
athletes.
Model A

Regression analysis
Response variate: Hg
Fitted terms: Constant, RCC
Summary of analysis
Source d.f. s.s. m.s. v.r. F pr.
Regression 1 259.37 259.3713 672.58 <.001
Residual 196 75.58 0.3856
Total 197 334.96 1.7003
Percentage variance accounted for 77.3
Standard error of observations is estimated to be 0.621.
Message: The following units have high leverage.
Unit Response Leverage
161 16.100 0.046
181 18.500 0.032
199 17.700 0.030
Estimates of parameters
Parameter estimate s.e. t(196) t pr.
Constant 2.045 0.483 4.24 <.001
RCC 2.653 0.102 25.93 <.001

(i) Which two parts of the output given by GenStat indicate that
there is a significant relationship between haemoglobin and
red blood cell count? On the basis of this GenStat output, is
it reasonable to say that an increase in red blood cell count
causes an increase in haemoglobin? Why or why not? [3]
(ii) In the GenStat output above, three units, units 161, 181 and
190, are flagged as having high leverage. Units 161 and 181
have large Cook’s statistics whereas unit 190 does not.
Briefly explain these findings. [3]
(iii) A composite residual plot for Model A is given in Figure 2. Is
there any feature, or features, of this plot that indicates that
the assumptions of the simple linear regression model do not
hold for this model? [3]

M346 October 2012 TURN OVER 5


Figure 2
(c) In addition to the haemoglobin and red blood cell count, the sex
of each individual athlete was recorded. There was interest in
whether the relationship between haemoglobin and red blood cell
count varied between the two sexes. In the GenStat spreadsheet,
the factor Sex takes the value 0 for male athletes, and takes the
value 1 for female athletes.
In GenStat, two models were fitted for these data:
Model B: constant + RCC + Sex
Model C: constant + RCC + Sex + RCC.Sex
The following output was obtained.
Model B

Regression analysis
Response variate: Hg
Fitted terms: Constant + RCC + Sex
Summary of analysis
Source d.f. s.s. m.s. v.r. F pr.
Regression 2 270.33 135.1632 407.82 <.001
Residual 195 64.63 0.3314
Total 197 334.96 1.7003

M346 October 2012 6


Percentage variance accounted for 80.5
Standard error of observations is estimated to be 0.576.
Message: The following units have high leverage.
Unit Response Leverage
69 15.900 0.053
74 15.000 0.054
137 13.500 0.050
153 14.300 0.043
161 16.100 0.055
Estimates of parameters
Parameter estimate s.e. t(195) t pr.
Constant 4.822 0.658 7.32 <.001
RCC 2.132 0.131 16.25 <.001
Sex 1 −0.651 0.113 −5.75 <.001

Model C

Regression analysis
Response variate: Hg
Fitted terms: Constant + RCC + Sex + RCC.Sex
Summary of analysis
Source d.f. s.s. m.s. v.r. F pr.
Regression 3 270.56 90.1859 271.69 <.001
Residual 194 64.40 0.3319
Total 197 334.96 1.7003
Percentage variance accounted for 80.5
Standard error of observations is estimated to be 0.576.
Message: The following units have large standardized residuals.
Unit Response Residual
181 18.500 2.92
Message: The following units have high leverage.
Unit Response Leverage
69 15.900 0.090
74 15.000 0.094
88 14.700 0.066
95 14.500 0.066
113 14.000 0.061
137 13.500 0.094
153 14.300 0.081
161 16.100 0.105
181 18.500 0.063
199 17.700 0.058
Estimates of parameters
Parameter estimate s.e. t(194) t pr.
Constant 5.403 0.958 5.64 <.001
RCC 2.015 0.191 10.54 <.001
Sex 1 −1.69 1.25 −1.35 0.177
RCC.Sex 1 0.220 0.263 0.83 0.405

M346 October 2012 TURN OVER 7


(i) Write down the two regression equations, one for male
athletes, the other for female athletes, that have been fitted
to data in Model B. Give a simple description of the
difference between the haemoglobin for male and female
athletes as implied by Model B. [4]
(ii) Using Model B calculate the point estimate for the
haemoglobin for a male athlete whose red blood cell count is
4.42 × 1012 cells per litre. [2]
(iii) Explain carefully which of the three models (Models A, B or
C) best describes the relationship between Hg and RCC. [4]
(iv) Data from how many athletes were used to produce the
output for Model C? Is it possible to deduce the number of
male athletes from the same output? [2]
(v) What assumption is made by Model C that would not be
made if simple linear regression lines were fitted to the data
from male and female athletes separately? [1]

M346 October 2012 8


Question 4
Data were collected from schools in a large city on a set of thirty-six
children who were identified as gifted children soon after they reached
the age of four. The analytical skills of the children were evaluated
using a standard testing procedure, and the response variable, score,
is the score on this test. An investigator is interested in
understanding the relationship, if any, between the analytical skills of
young gifted children and the following variables.
fatheriq :
Father’s IQ
motheriq :
Mother’s IQ
speak :
Age in months when the child first said ‘mummy’ or ‘daddy’
count :
Age in months when the child first counted to 10 successfully
read :
Average number of hours per week the child’s mother or father reads to the child
edutv :
Average number of hours per week the child watched an educational program
on TV during the past three months
cartoons : Average number of hours per week the child watched cartoons on TV during
the past three months

(a) Figure 3 is a scatterplot matrix of the data. On the basis of this


plot, it was decided that neither the response variable nor any of
the explanatory variables would be transformed. Explain why
this decision was made. [2]

Figure 3

M346 October 2012 TURN OVER 9


(b) The following GenStat output gives the correlation matrix of the
7 explanatory variables for this dataset, the results of a multiple
regression analysis of the full 7-variable model (Model A) on the
response variable score, and the results of seven individual simple
linear regressions of score on each of the explanatory variables in
turn. (The correlation matrix and full multiple regression analysis
are given verbatim from GenStat; the individual simple linear
regression results have been edited into a single table.)

Correlations
fatheriq
motheriq −0.0248
speak −0.0305 0.0722
count −0.0750 0.0243 0.0595
read −0.0682 −0.0430 0.1851 0.9103
edutv 0.1162 −0.3300 −0.1545 −0.2157 −0.1666
cartoons −0.2484 0.3384 0.1094 0.1549 0.1257 −0.9234
fatheriq motheriq speak count read edutv cartoons
Model A

Regression analysis
Response variate: score
Fitted terms: Constant + fatheriq + motheriq + speak + count + read
+ edutv + cartoons
Summary of analysis
Source d.f. s.s. m.s. v.r. F pr.
Regression 7 562.4 80.343 11.97 <.001
Residual 28 187.9 6.711
Total 35 750.3 21.437
Percentage variance accounted for 68.7
Standard error of observations is estimated to be 2.59.
Message: The following units have large standardized residuals.
Unit Response Residual
24 169.00 2.36
Message: The following units have high leverage.
Unit Response Leverage
19 160.00 0.56
Estimates of parameters
Parameter estimate s.e. t(28) t pr.
Constant 75.5 24.0 3.14 0.004
fatheriq 0.252 0.138 1.84 0.077
motheriq 0.4001 0.0729 5.49 <.001
speak 0.188 0.148 1.27 0.214
count 0.206 0.266 0.78 0.445
read 7.54 5.59 1.35 0.188
edutv −4.20 2.25 −1.87 0.072
cartoons −3.34 2.02 −1.65 0.109

M346 October 2012 10


Summary table of results of individual simple linear
regression on each explanatory variable in turn
Parameter estimate s.e. t(34) t pr.
fatheriq 0.250 0.224 1.12 0.272
motheriq 0.407 0.100 4.06 <.001
speak 0.385 0.237 1.62 0.114
count 0.584 0.154 3.78 <.001
read 11.81 3.28 3.60 0.001
edutv −3.07 1.32 −2.32 0.026
cartoons 1.81 1.23 1.47 0.150
What does this output suggest about which explanatory variables
are likely to be included in a good multiple regression model
based on a subset of the seven explanatory variables? Explain
your answer clearly, making explicit which parts of the output
relate to each of your conclusions. [7]
(c) The stepwise regression method provided by GenStat, with M346
default choices, was applied to the data set. Starting from the full
model, Model B was arrived at, as follows.

Model B

Regression analysis
Response variate: score
Fitted terms: Constant + motheriq + read + edutv + cartoons
Summary of analysis
Source d.f. s.s. m.s. v.r. F pr.
Regression 4 530.4 132.593 18.69 <.001
Residual 31 219.9 7.095
Total 35 750.3 21.437
Change 3 32.0 10.677 1.59 0.214
Percentage variance accounted for 66.9
Standard error of observations is estimated to be 2.66.
Message: The following units have high leverage.
Unit Response Leverage
4 157.00 0.36
33 151.00 0.31
Estimates of parameters
Parameter estimate s.e. t(31) t pr.
Constant 112.5 14.4 7.83 <.001
motheriq 0.4177 0.0740 5.64 <.001
read 11.60 2.24 5.19 <.001
edutv −6.05 2.12 −2.85 0.008
cartoons −5.11 1.88 −2.72 0.011

M346 October 2012 TURN OVER 11


(i) Briefly describe in general the process by which GenStat arrives
at a parsimonious model using stepwise regression. [2]
(ii) Does Model B seem reasonable given the preliminary analysis you
did in part (b)? Why or why not? [3]
(iii) Write down the regression equation for Model B and explain in
qualitative terms what the model says about the dependence of
analytical skills on the explanatory variables. [3]
(iv) Estimate the expected score for a gifted child, whose father’s IQ
is 125, mother’s IQ is 129, first said ‘mummy’ or ‘daddy’ at
18 months, first counted to 10 successfully at 32 months, mother
or father reads to them on average 1.8 hours per week, watches
an average 1.5 hours a week of educational TV and watches an
average 3.75 hours a week of cartoons. [2]
(v) For Model B, unit 19 has relatively low leverage. Explain briefly
why this unit can be flagged as having high leverage in the full
model (Model A) but not in Model B, the model given by
stepwise regression. [3]
(vi) Model B has a smaller percentage of variance explained than has
the full model (Model A). Explain briefly why it may well still be
preferable to use Model B rather than Model A. [3]

M346 October 2012 12


Question 5
(a) An experiment was carried out to explore the possible effect of
different types of pollen on the diameter of Cox’s Orange Pippin
apples. In the experiment, the pollen from four different varieties
of apple (factor pollen) 1 – King of the Pippins, 2 – Ellison’s
Orange, 3 – James Grieve and 4 – Worcester Pearmain were used
to pollinate blossom on Cox’s Orange Pippin apple trees.
The experiment required blossom to be hand-pollinated, a
time-consuming process. So, to reduce workload, blossom on just
eight Cox’s Orange Pippin trees was pollinated. The blossom on
each tree was split into three groups and pollen from a different
apple variety applied to each group. The mean diameter, in
millimetres (variate diameter) of the resulting apples in each group
were then recorded. Table 1 gives the data from this experiment.
Table 1
Pollen variety (pollen)
Tree (tree) 1 2 3 4
1 — 18.71 17.01 17.23
2 20.42 — 18.02 19.59
3 21.39 20.85 — 18.94
4 19.49 18.48 17.22 —
5 — 23.16 23.33 23.34
6 24.67 — 24.54 25.00
7 24.14 21.78 — 20.86
8 26.70 25.75 24.77 —
(i) The design of this experiment is what is known as a balanced
incomplete block design. In what sense is the design
‘balanced’ and in what sense is the design ‘incomplete’ ?
Suggest a reason why the investigators decided to regard each
Cox’s Orange Pippin tree as a block. [4]
(ii) How would you lay out the data from this experiment in a
GenStat spreadsheet? Give the number of rows and columns
and describe the information each column would contain. [4]

M346 October 2012 TURN OVER 13


(iii) General analysis of variance was used in GenStat to analyse
the data from this experiment, with the block structure given
as tree and the treatment structure as pollen. The resulting
analysis of variance table was as follows.

Analysis of variance
Variate: diameter

Source of variation d.f. s.s. m.s. v.r. F pr.

tree stratum
pollen 3 10.3004 3.4335 0.08 0.969
Residual 4 177.2346 44.3087 85.53

tree.∗Units∗stratum
pollen 3 11.5671 3.8557 7.44 0.004
Residual 13 6.7350 0.5181

Total 23 205.8372
Conduct a formal statistical test to investigate whether the
diameter of apples does depend on the type of pollen. (You
need to give details of the test statistic, the p value for the
test, and the degrees of freedom of the distribution with
which the test statistic is to be compared.) [4]

M346 October 2012 14


(iv) Figure 4 is the corresponding composite residual plot.

Figure 4
What does Figure 4 tell you about the appropriateness of the
model that has been fitted? Justify your answer. [4]
(v) How could the same model be fitted without using ANOVA? [2]
(b) In an experiment, the effect of the following three factors on the
hardness of dental fillings (variate hardness) was explored.
gold : the type of gold used (eight levels)
dentist : the dentist making up the filling (five levels)
condense : the condensation method used (three levels)
Every possible treatment combination was used, and there was
one replication per treatment combination. Two models were
then fitted to the data.
Model A : dentist + condense + gold + dentist.condense
+ dentist.gold + condense.gold
Model B : gold + dentist*condense.
(i) In Model A what assumption is made about the three-way
interaction? Why is this assumption necessary in order to
carry out inference? [2]

M346 October 2012 TURN OVER 15


(ii) The analysis of variance table associated with Model A is as
follows.

Analysis of variance
Variate: hardness

Source of variation d.f. s.s. m.s. v.r. F pr.


dentist 4 218313. 54578. 5.46 <.001
condense 2 596611. 298305. 29.87 <.001
gold 7 220578. 31511. 3.15 0.007
dentist.condense 8 262741. 32843. 3.29 0.004
dentist.gold 28 209437. 7480. 0.75 0.796
condense.gold 14 209849. 14989. 1.50 0.141
Residual 56 559319. 9988.
Total 119 2276847.
Based on this output, explain carefully why Model B can be
regarded as the simplest adequate model for these data. [3]
(iii) Is it plausible that Figure 5 is a plot of means arising from
Model B? Justify your answer. [2]

Figure 5

M346 October 2012 16


Question 6
In a 1970’s study of 44 USA coastal rivers, the number of different
species of freshwater mussels found in each river was recorded
(GenStat variate species). Also recorded for each river were the
following four explanatory variables.
area : Area of the drainage basin, in square miles
nitrate : Nitrate concentration, in parts per million (ppm)
solids : The solid residue left after evaporation, in parts per million (ppm)
pH : The pH of the water

Based on scatterplots, it was decided to work with larea = log(area),


lnitrate = log(nitrate) and lsolids = log(solids).
(a) Initially two approaches to modelling the data were considered:
a multiple regression model with lspecies = log(species) as the
response variate;
a Poisson regression with species as the response variate and
using a log link.
(i) Give one way in which the two approaches differ. [2]
(ii) How is the dependency of the mean number of species on the
explanatory variables modelled in the Poisson regression
approach? [2]
(iii) Why it is not surprising that a log link was considered for the
Poisson regression? [1]
(iv) Which of the four plots in a composite residual plot
(histogram, normal, half-normal and fitted value) would be
useful in judging whether a Poisson regression model fits
adequately? For the useful plots, what would you expect to
see if the model does indeed fit adequately? [4]
(b) It was decided to pursue modelling species using Poisson
regression. In GenStat the following model (Model A) was fitted
to the data. The resulting output is given below.
Model A

Regression analysis
Response variate: species
Distribution: Poisson
Link function: Log
Fitted terms: Constant + larea + lnitrate + lsolids + pH
Summary of analysis
mean deviance approx
Source d.f. deviance deviance ratio chi pr
Regression 4 61.34 15.335 15.34 <.001
Residual 39 68.36 1.753
Total 43 129.70 3.016
Dispersion parameter is fixed at 1.00.
Message: Deviance ratios are based on dispersion parameter with
value 1.

M346 October 2012 TURN OVER 17


Message: The following units have large standardized residuals.
Unit Response Residual
6 8.00 2.40
29 2.00 −3.39
40 20.00 2.78
41 33.00 2.94
44 23.00 2.88
Message: The following units have high leverage.
Unit Response Leverage
31 10.00 0.28
Estimates of parameters
antilog of
Parameter estimate s.e. t(∗) t pr. estimate
Constant 3.22 1.55 2.09 0.037 25.09
larea 0.3012 0.0507 5.95 <.001 1.352
lnitrate −0.0374 0.0542 −0.69 0.490 0.9633
lsolids −0.229 0.106 −2.16 0.031 0.7949
pH −0.322 0.143 −2.26 0.024 0.7247
Message: s.e.s are based on dispersion parameter with value 1.
(i) In the output for Model A, GenStat reports an approx chi
pr. of < 0.001. What was the value of the corresponding test
statistic and how many degrees of freedom does the χ2
distribution referred to have? What does this probability tell
us about the parameters of the model? [4]
(ii) Use the output to estimate the number of freshwater mussel
species that would be found in a river with a drainage basin
of 2050 square miles, 2.1 ppm nitrates, 140 ppm solids and a
pH of 7. [4]
(iii) Does it appear that the model could be simplified? Justify
your answer. Why is your answer necessarily only
approximate? [3]
(c) Another Poisson regression model, Model B, was also fitted to the
data, by dropping lnitrate, lsolids and pH from Model A. The
resulting output produced by GenStat is given below. For these
data is Model B better than Model A? Justify why or why not. [2]

M346 October 2012 18


Model B

Regression analysis
Response variate: species
Distribution: Poisson
Link function: Log
Fitted terms: Constant + larea
Summary of analysis
mean deviance approx
Source d.f. deviance deviance ratio chi pr
Regression 1 51.88 51.881 51.88 <.001
Residual 42 77.82 1.853
Total 43 129.70 3.016
Change 3 9.46 3.153 3.15 0.024
Dispersion parameter is fixed at 1.00.
Message: Deviance ratios are based on dispersion parameter with
value 1.
Message: The following units have large standardized residuals.
Unit Response Residual
12 11.00 −2.39
29 2.00 −3.26
40 20.00 2.89
41 33.00 3.34
44 23.00 3.20
Message: The following units have high leverage.
Unit Response Leverage
12 11.00 0.145
Estimates of parameters
antilog of
Parameter estimate s.e. t(∗) t pr. estimate
Constant −0.292 0.402 −0.73 0.468 0.7470
larea 0.3214 0.0461 6.97 <.001 1.379
Message: s.e.s are based on dispersion parameter with value 1.
(d) (i) Does overdispersion appear to be a problem in Model B?
Justify your answer. [2]
(ii) Regardless of your answer to part (d)(i), suppose it is decided
that Model B is overdispersed. Suggest a way of changing
Model B to deal with the overdispersion. [1]

M346 October 2012 TURN OVER 19


Question 7
The data in this question relate to a 1970s investigation into
satisfaction with housing in Copenhagen, Denmark. Residents of
selected areas living in rented homes built between 1960 and 1968
were questioned. For each respondent the following were recorded.
housing : the type of housing they had (tower blocks, apartments, atrium houses
and terraced houses)
influenc : their feeling of influence on apartment management (low, medium, high)
contact : their degree of contact with neighbors (low, high),
satisfac : their satisfaction with housing conditions (low, medium, high)
It was decided to analyze the data using loglinear analysis in GenStat.
(a) Describe how these data should be entered into a GenStat
spreadsheet. Your answer should include the number of rows and
columns required along with a description of what should be in
each column. [4]
(b) Which terms, if any, must be included in any sensible loglinear
model as a result of the way the study was designed? Justify your
answer. [2]
(c) For these data, what terms would be included in the saturated
model? What would be the residual deviance and corresponding
residual degrees of freedom? [2]
Using GenStat, a loglinear model including all four main effects and
all two-way interactions was fitted (Model A). The resulting output is
given below together with some potentially useful χ2 probabilities.
Model A

Regression analysis
Response variate: counts
Distribution: Poisson
Link function: Log
Fitted terms: Constant + housing + influenc + contact + satisfac
+ housing.influenc + housing.contact + influenc.contact
+ housing.satisfac + influenc.satisfac + contact.satisfac
Summary of analysis
mean deviance approx
Source d.f. deviance deviance ratio chi pr
Regression 31 789.71 25.474 25.47 <.001
Residual 40 43.95 1.099
Total 71 833.66 11.742
Dispersion parameter is fixed at 1.00.
Message: Deviance ratios are based on dispersion parameter with value 1.

M346 October 2012 20


Chi-square Cumulative Upper Probabilities
31 degrees of freedom and X deviate of 789.71: 0
31 degrees of freedom and X deviate of 25.474: 0.7461
40 degrees of freedom and X deviate of 43.95: 0.3079
40 degrees of freedom and X deviate of 1.099: 1.000
71 degrees of freedom and X deviate of 833.66: 0
71 degrees of freedom and X deviate of 11.742: 1.000
(d) Why it is reasonable to assert that Model A adequately fits the
data? Your answer should include the value of the relevant test
statistic and the distribution to which this test statistic is
compared as well as the p value. [4]
(e) Six models, all simplifications of Model A, were also fitted using
GenStat. Each simpler model included all the main effects and all
but one of the two-factor interactions. Table 3 gives, for each
simpler model fitted, the change in deviance compared with
Model A, the corresponding change in d.f.s and the associated
p value.
Table 3

Interaction dropped Change in deviance d.f. p value


housing.influenc 13.69 6 0.033
housing.contact 44.04 3 <0.001
influenc.contact 23.71 2 <0.001
housing.satisfac 62.20 6 <0.001
influenc.satisfac 109.10 4 <0.001
contact.satisfac 16.02 2 <0.001

(i) Why is the degrees of freedom associated with dropping the


housing.satisfac term equal to 6? [1]
(ii) Which, if any, of the two-factor interactions listed in Table 3
could be dropped from Model A? Justify your answer. [2]
(iii) Using Table 3, complete the following Summary of analysis
table for Model B, the model resulting when the two-factor
interaction housing.satisfac is omitted from Model A. [7]
Model B

Regression analysis
Response variate: counts
Distribution: Poisson
Link function: Log
Fitted terms: Constant + contact + housing + influenc + satisfac
+ contact.housing + contact.influenc + contact.satisfac
+ influenc.satisfac + housing.influenc

M346 October 2012 TURN OVER 21


Summary of analysis
mean deviance approx
Source d.f. deviance deviance ratio chi pr
Regression ∗∗ ∗∗∗∗∗∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ <.001
Residual ∗∗ ∗∗∗∗∗∗ ∗∗∗∗∗∗
Total ∗∗ ∗∗∗∗∗∗ ∗∗∗∗∗∗
Dispersion parameter is fixed at 1.00.
Message: Deviance ratios are based on dispersion parameter
with value 1.
(vi) Briefly describe the table or tables you would use to best
summarise the relationship between housing and satisfac that
is implied by Model B. [1]
(f) Write down the loglinear model that would exactly match a
logistic regression model which had contact as the response and
had terms satisfac + housing + satisfac.housing. [2]
[END OF QUESTION PAPER]

M346 October 2012 22

You might also like