Module Code: MATH371401
Module Title: Linear Regression and Robustness ©UNIVERSITY OF LEEDS
School of Mathematics Semester One 202223
Calculator instructions:
• You are allowed to use a non-programmable calculator in this examination.
Dictionary instructions:
• You are not allowed to use your own dictionary in this exam. A basic English dictionary
is available to use. Raise your hand and ask an invigilator if you need it.
Exam information:
• There are 9 pages to this examination.
• There will be 2 hours 30 minutes to complete this examination.
• This examination is worth 80% of the module mark.
• There are four questions in this exam paper. You must answer all four questions.
• All questions are worth 20 marks.
• The numbers in brackets indicate the marks available for each part of each question.
• Statistical tables are attached.
• You must show all your calculations.
• You must write your answers in the answer booklet provided. If you require an additional
answer booklet, raise your hand so an invigilator can provide one.
• You must clearly state your name and Student ID Number in the relevant boxes on your
answer booklet. Other boxes may be left blank.
Page 1 of 9 Turn the page over
Module Code: MATH371401
1. The linear regression model in matrix notation is given by
y = Xβ + ε
where y ∈ Rn , β ∈ Rp+1 and X ∈ Rn×(p+1) and ε ∼ N (0, σ 2 I).
(a) i. Explain the meaning of n, p, y, β, ε and σ 2 in a linear regression problem. [3]
ii. Explain how the design matrix X is constructed. [2]
(b) Consider the estimator β̂ = (X ⊤ X)−1 X ⊤ y for β.
i. Which quantity is minimised by β̂? [2]
ii. Show that E(β̂) = β and give an interpretation of this result. [3]
iii. Derive a formula for Cov(β̂). [3]
(c) Consider the following R summary output for a fitted linear regression model: [3]
Call:
lm(formula = y ~ x1 + x2 + x3)
Residuals:
Min 1Q Median 3Q Max
-2.7191 -0.6446 0.0393 0.6623 2.8102
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.78356 0.21076 51.165 <2e-16 ***
x1 0.04560 0.33819 0.135 0.893
x2 0.14167 0.09130 1.552 0.124
x3 -0.09134 0.10183 -0.897 0.372
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9525 on 96 degrees of freedom
Multiple R-squared: 0.03006, Adjusted R-squared: -0.0002527
F-statistic: 0.9917 on 3 and 96 DF, p-value: 0.4002
What problem with the model does this output indicate?
(d) Assume that the assumption E(εi ) = 0 is violated and that we have E(εi ) = 1 for [4]
i ∈ {1, . . . , n} instead. Give an intuitive explanations how this would affect the
estimate β̂.
Page 2 of 9 Turn the page over
Module Code: MATH371401
2. (a) Give the definition of the t-distribution with ν degrees of freedom, as we used it in [3]
this module.
(b) Assume that X ∼ N (0, 1), Y ∼ N (0, 1) and Z ∼ χ2 (4) are independent. What [5]
are the distributions of the following random variables?
i. X +Y
ii. X2 + Y 2
iii. X 2 /Y 2
iv. X2 + Y 2 + Z
√
v. 2Y / Z
(c) Consider the regression model described by the following R output: [4]
Call:
lm(formula = y ~ ., data = data)
Residuals:
Min 1Q Median 3Q Max
-1.3356 -0.4612 -0.1078 0.5209 1.6900
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.48360 0.46486 3.191 0.00568 **
x1 2.04917 0.08134 25.192 2.66e-14 ***
x2 -3.20778 0.16827 -19.063 2.00e-12 ***
x3 4.06104 0.15864 25.599 2.07e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8378 on 16 degrees of freedom
Multiple R-squared: 0.9863, Adjusted R-squared: 0.9838
F-statistic: 385.2 on 3 and 16 DF, p-value: 4.017e-15
Find a 95% confidence interval for β2 . (Statistical tables are attached after the
end of questions.)
(d) Consider a linear regression model y = Xβ + ε with ε ∼ N (0, σ 2 I) with p = 3
explanatory variables. Assume we want to test the hypothesis H0 : β1 = β2 against
the alternative H0 : β1 ̸= β2 .
i. State the test statistic for this test. [2]
ii. Assuming H0 is true, what is the distribution of the test statistic? [2]
(e) Using the fact that [4]
2
(n − p − 1)σ̂
∼ χ2 (n − p − 1),
σ2
construct a 95% confidence interval for σ 2 .
Page 3 of 9 Turn the page over
Module Code: MATH371401
3. (a) Consider the coefficient of multiple determination R2 :
i. State the formula for R2 in terms of the sample variances s2ε̂ and s2y . [1]
2
ii. Briefly describe the role of R as a diagnostic aid in linear regression and give [2]
an interpretation of the formula from part i.
(b) i. When fitting a linear model to data, give two reasons why transforming the [2]
data may be desirable.
ii. With the help of sketches, briefly describe two features of a residual plot which [3]
would indicate that a transformation of the data might be desirable.
(Question 3 continues on the next page.)
Page 4 of 9 Turn the page over
Module Code: MATH371401
(c) In R, three different models are fitted using the following commands. [7]
m1 <- lm(y ~ x1 + x2)
m2 <- lm(sqrt(y) ~ sqrt(x1) + sqrt(x2))
m3 <- lm(log(y) ~ log(x1) + log(x2))
Residual plots and Q-Q plots for the three models are shown below:
1000
1000
Sample Quantiles
resid(m1)
500
500
0
0
−100 100 300 500 −2 −1 0 1 2
fitted(m1) Theoretical Quantiles
Sample Quantiles
15
15
resid(m2)
0 5
0 5
−10
−10
−5 0 5 10 15 20 −2 −1 0 1 2
fitted(m2) Theoretical Quantiles
2
2
Sample Quantiles
1
1
resid(m3)
0
−1
−1
−2
−2
−4 −2 0 2 4 6 −2 −1 0 1 2
fitted(m3) Theoretical Quantiles
Discuss the relevant features visible in the six panels. Which model do you prefer?
(d) For each of the three models, what is y as a function of x1 , x2 and ε? [5]
Page 5 of 9 Turn the page over
Module Code: MATH371401
4. (a) Consider Cook’s distance Di .
i. What does it tell us about the data, if Di > 1? [2]
ii. Introducing any notation you use, explain how Di is computed. [3]
(b) Introducing any notation you use, give the definition of an M-estimate for the [3]
regression coefficients β.
(c) Under which circumstances are M-estimates more robust than the least squares [3]
regression estimate. Justify your answer.
(d) Define Huber’s t-function and show that the function is convex. Why is this a [3]
desirable property?
(e) In a few sentences, explain what the breakdown point of a regression estimate is. [2]
(f) Give a proof of the fact that the breakdown point of the least squares regression [4]
estimate is 0%.
Page 6 of 9 End of questions
Module Code: MATH371401
Normal Distribution Function Tables
The first table gives
Z x
1 1 2
Φ(x) = √ e− 2 t dt
2π −∞
and this corresponds to the shaded area in Φ(x)
the figure to the right. Φ(x) is the prob-
ability that a random variable, normally dis-
tributed with zero mean and unit variance, will
be less than or equal to x. When x < 0 use
Φ(x) = 1 − Φ(−x), as the normal distribution x
with mean zero is symmetric about zero. To
interpolate, use the formula
x − x1
Φ(x) ≈ Φ(x1 ) + Φ(x2 ) − Φ(x1 )
x 2 − x1
Table 1
x Φ(x) x Φ(x) x Φ(x) x Φ(x) x Φ(x) x Φ(x)
0.00 0.5000 0.50 0.6915 1.00 0.8413 1.50 0.9332 2.00 0.9772 2.50 0.9938
0.05 0.5199 0.55 0.7088 1.05 0.8531 1.55 0.9394 2.05 0.9798 2.55 0.9946
0.10 0.5398 0.60 0.7257 1.10 0.8643 1.60 0.9452 2.10 0.9821 2.60 0.9953
0.15 0.5596 0.65 0.7422 1.15 0.8749 1.65 0.9505 2.15 0.9842 2.65 0.9960
0.20 0.5793 0.70 0.7580 1.20 0.8849 1.70 0.9554 2.20 0.9861 2.70 0.9965
0.25 0.5987 0.75 0.7734 1.25 0.8944 1.75 0.9599 2.25 0.9878 2.75 0.9970
0.30 0.6179 0.80 0.7881 1.30 0.9032 1.80 0.9641 2.30 0.9893 2.80 0.9974
0.35 0.6368 0.85 0.8023 1.35 0.9115 1.85 0.9678 2.35 0.9906 2.85 0.9978
0.40 0.6554 0.90 0.8159 1.40 0.9192 1.90 0.9713 2.40 0.9918 2.90 0.9981
0.45 0.6736 0.95 0.8289 1.45 0.9265 1.95 0.9744 2.45 0.9929 2.95 0.9984
0.50 0.6915 1.00 0.8413 1.50 0.9332 2.00 0.9772 2.50 0.9938 3.00 0.9987
Table 2. The inverse function Φ−1 (p) is tabulated below for various values of p.
p 0.900 0.950 0.975 0.990 0.995 0.999 0.9995
Φ−1 (p) 1.2816 1.6449 1.9600 2.3263 2.5758 3.0902 3.2905
Page 7 of 9 Turn the page over
Module Code: MATH371401
Quantiles of the t-Distribution
This table gives the α-quantiles for the t(ν)-
distribution with various degrees of freedom ν,
and for various values of α, as indicated by
the figure to the right. Quantiles for the
lower tail can be found using the symmetry P (T ≤ qα ) = α
P (T ≤ −q) = P (T ≥ q) = 1 − P (T ≤ q).
The limiting distribution of t(ν) as ν → ∞ is
the standard normal distribution.
0 qα x
ν α = 0.9 α = 0.95 α = 0.975 α = 0.99 α = 0.995
1 3.078 6.314 12.706 31.821 63.657
2 1.886 2.920 4.303 6.965 9.925
3 1.638 2.353 3.182 4.541 5.841
4 1.533 2.132 2.776 3.747 4.604
5 1.476 2.015 2.571 3.365 4.032
6 1.440 1.943 2.447 3.143 3.707
7 1.415 1.895 2.365 2.998 3.499
8 1.397 1.860 2.306 2.896 3.355
9 1.383 1.833 2.262 2.821 3.250
10 1.372 1.812 2.228 2.764 3.169
11 1.363 1.796 2.201 2.718 3.106
12 1.356 1.782 2.179 2.681 3.055
13 1.350 1.771 2.160 2.650 3.012
14 1.345 1.761 2.145 2.624 2.977
15 1.341 1.753 2.131 2.602 2.947
16 1.337 1.746 2.120 2.583 2.921
17 1.333 1.740 2.110 2.567 2.898
18 1.330 1.734 2.101 2.552 2.878
19 1.328 1.729 2.093 2.539 2.861
20 1.325 1.725 2.086 2.528 2.845
25 1.316 1.708 2.060 2.485 2.787
30 1.310 1.697 2.042 2.457 2.750
35 1.306 1.690 2.030 2.438 2.724
40 1.303 1.684 2.021 2.423 2.704
45 1.301 1.679 2.014 2.412 2.690
50 1.299 1.676 2.009 2.403 2.678
60 1.296 1.671 2.000 2.390 2.660
80 1.292 1.664 1.990 2.374 2.639
100 1.290 1.660 1.984 2.364 2.626
Page 8 of 9 Turn the page over
Module Code: MATH371401
Quantiles of the χ2-Distribution
This table gives the α-quantiles for the χ2 (ν)-
distribution with various degrees of freedom ν,
and for various values of α, as indicated by the P (Y ≤ qα ) = α
figure to the right. (The figure uses ν√= 3.)
If Y ∼ χ2 (ν) for ν > 100, then 2Y is
approximately
√ normally distributed with mean
2ν − 1 and unit variance.
0 qα x
ν α = 0.025 α = 0.9 α = 0.95 α = 0.975 α = 0.99 α = 0.995
1 0.001 2.706 3.841 5.024 6.635 7.879
2 0.051 4.605 5.991 7.378 9.210 10.597
3 0.216 6.251 7.815 9.348 11.345 12.838
4 0.484 7.779 9.488 11.143 13.277 14.860
5 0.831 9.236 11.070 12.833 15.086 16.750
6 1.237 10.645 12.592 14.449 16.812 18.548
7 1.690 12.017 14.067 16.013 18.475 20.278
8 2.180 13.362 15.507 17.535 20.090 21.955
9 2.700 14.684 16.919 19.023 21.666 23.589
10 3.247 15.987 18.307 20.483 23.209 25.188
11 3.816 17.275 19.675 21.920 24.725 26.757
12 4.404 18.549 21.026 23.337 26.217 28.300
13 5.009 19.812 22.362 24.736 27.688 29.819
14 5.629 21.064 23.685 26.119 29.141 31.319
15 6.262 22.307 24.996 27.488 30.578 32.801
16 6.908 23.542 26.296 28.845 32.000 34.267
17 7.564 24.769 27.587 30.191 33.409 35.718
18 8.231 25.989 28.869 31.526 34.805 37.156
19 8.907 27.204 30.144 32.852 36.191 38.582
20 9.591 28.412 31.410 34.170 37.566 39.997
25 13.120 34.382 37.652 40.646 44.314 46.928
30 16.791 40.256 43.773 46.979 50.892 53.672
35 20.569 46.059 49.802 53.203 57.342 60.275
40 24.433 51.805 55.758 59.342 63.691 66.766
45 28.366 57.505 61.656 65.410 69.957 73.166
50 32.357 63.167 67.505 71.420 76.154 79.490
60 40.482 74.397 79.082 83.298 88.379 91.952
80 57.153 96.578 101.879 106.629 112.329 116.321
100 74.222 118.498 124.342 129.561 135.807 140.169
Page 9 of 9 End of paper