0% found this document useful (0 votes)
216 views

Statistical Data Analysis Full Project

The document discusses empirical Bayesian credibility theory (EBCT) models for estimating the number of new business startups in different regions of the UK. [1] It applies EBCT Model 1 to estimate startups for Q1 2020 using historical startup data from 2017-2019. [2] When more data is available for 2013-Q1 2020, it recalculates the estimates using EBCT Model 1, finding a higher credibility value of 0.99 compared to 0.985 previously. [3] It notes EBCT Model 1 ignores risk volume data like population, which EBCT Model 2 accounts for using population figures to adjust the estimates.

Uploaded by

ash
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
216 views

Statistical Data Analysis Full Project

The document discusses empirical Bayesian credibility theory (EBCT) models for estimating the number of new business startups in different regions of the UK. [1] It applies EBCT Model 1 to estimate startups for Q1 2020 using historical startup data from 2017-2019. [2] When more data is available for 2013-Q1 2020, it recalculates the estimates using EBCT Model 1, finding a higher credibility value of 0.99 compared to 0.985 previously. [3] It notes EBCT Model 1 ignores risk volume data like population, which EBCT Model 2 accounts for using population figures to adjust the estimates.

Uploaded by

ash
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Ashley Pancholi

Statistical data analysis project MA2206

Question 1

Part i)

The EBCT model 1 is a model which incorporates Bayesian thinking and can be seen as a generalisation of the
normal/normal model. Unlike in standard Bayesian credibility our aim with the empirical Bayesian credibility model 1 is to
estimate not θ but some function of θ given by m(θ). When using the EBCT model 1 we need to define the following
notation:

𝑋𝑖𝑗=𝑡ℎ𝑒 𝑎𝑔𝑔𝑟𝑒𝑔𝑟𝑎𝑡𝑒 𝑐𝑙𝑎𝑖𝑚𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑗𝑡ℎ 𝑦𝑒𝑎𝑟 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑖𝑡ℎ 𝑟𝑖𝑠𝑘
1
𝑋̅𝑖 = ∑𝑛𝑗=1 𝑋𝑖𝑗
𝑛

1
𝑋̅ = ∑𝑁 ̅
𝑖=1 𝑋𝑖
𝑁

Here are the Parameter estimations I have used in my calculations:

Quantity Estimator

E[𝑚(𝜃)] 𝑋̅
1 1
E[𝑠 2 (𝜃)] ∑𝑁
𝑖=1 { ∑𝑛𝑗=1(𝑋̅𝑖𝑗 − 𝑋̅𝑖 )^2}
𝑁 𝑛−1

1 1 1
Var[𝑚(𝜃)] ∑𝑁 ̅ ̅
𝑖=1(𝑋𝑖 − 𝑋 )^2 − ∑𝑁
𝑖=1 { ∑𝑛𝑗=1(𝑋𝑖𝑗 − 𝑋̅𝑖 ) ^2}
𝑁−1 𝑁𝑛 𝑛−1

Credibility formula:
𝑛
𝑍= 𝐸[𝑠2 (𝜃)]
𝑛+
𝑉𝑎𝑟[𝑚(𝜃)]

When using the EBCT model 1 to model the number of new businesses/companies by quarter on quarter for various
regions the 𝑋𝑖𝑗 represents the total number of new companies per quarter in the jth year from the ith risk. Here j
represents the time units we are working in which is in quarters and i denotes the different regions. Also in this example n
denotes the number of time units of available data and N denotes the number of types of risk. Hence in this example n=8
and N=9

From performing the relevant intermediate calculations under the assumption of EBCT model 1 we can now determine our
credibility factor Z given by
𝑛 8
𝑍= 𝐸[𝑠2 (𝜃)]
= 5225846.28 = 0.985268
𝑛+ 8+
𝑉𝑎𝑟[𝑚(𝜃)} 43688833.1

After determining that our credibility factor Z is approximately 0.9853, we can now perform estimates for the number of
new business start-ups we would expect in the regions for the following quarter i.e. Q1 for 2020.

The estimation formula we use is.

𝑍𝑋̅𝑖 + (1-Z)μ

Where 𝑍 is our credibility factor we have already defined, μ is our prior mean i.e. E[m(θ)] and 𝑋̅𝑖 is our mean from our
sample data. We now use this formula to determine the expected number of new company start-ups we would expect
under EBCT model 1 next quarter for all the regions.

Region of interest Credibility factor Projected number of company start-


ups in Q2 2020
North East 0.9583 2459
North West 0.9583 10386
Yorkshire and the Humber 0.9583 6070
East Midlands 0.9583 6036
West Midlands 0.9583 8590
East 0.9583 8983
London 0.9583 25015
South East 0.9583 13840
South West 0.9583 6312

From performing the estimate calculations we can see the projected figures for q1 2020 are reasonably close to their
corresponding sample mean for each region, respectively. This is because the credibility factor we calculated was very high
at 0.9853 because we essentially place more confidence and relevance in the risk data hence the estimates, we have
calculated are an average but is weighted much more towards the risk data and less relevance is placed on the collateral
data.

Part ii)

Using the assumption of the EBCT model 1 again we define the following quantities:

𝑋𝑖𝑗=𝑡ℎ𝑒 𝑎𝑔𝑔𝑟𝑒𝑔𝑟𝑎𝑡𝑒 𝑐𝑙𝑎𝑖𝑚𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑗𝑡ℎ 𝑦𝑒𝑎𝑟 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑖𝑡ℎ 𝑟𝑖𝑠𝑘
1
𝑋̅𝑖 = ∑𝑛𝑗=1 𝑋𝑖𝑗
𝑛

1
𝑋̅ = ∑𝑁 ̅
𝑖=1 𝑋𝑖
𝑁

Here are the Parameter estimation formulas I have used in my calculations:

Quantity Estimator

E[𝑚(𝜃)] 𝑋̅
1 1
E[𝑠 2 (𝜃)] ∑𝑁
𝑖=1 { ∑𝑛𝑗=1(𝑋̅𝑖𝑗 − 𝑋̅𝑖 )^2}
𝑁 𝑛−1

1 1 1
Var[𝑚(𝜃)] ∑𝑁 ̅ ̅
𝑖=1(𝑋𝑖 − 𝑋 )^2 − ∑𝑁
𝑖=1 { ∑𝑛𝑗=1(𝑋𝑖𝑗 − 𝑋̅𝑖 ) ^2}
𝑁−1 𝑁𝑛 𝑛−1

Credibility formula:
𝑛
𝑍= 𝐸[𝑠2 (𝜃)]
𝑛+
𝑉𝑎𝑟[𝑚(𝜃)]
Given we have more available data regarding the amount of new company births/start-ups to us we need to adjust the
value of n as there are 13-time units so n=13.

From performing the relevant intermediate calculations under the assumption of EBCT model 1 we can now determine the
credibility factor:
𝑛 13
𝑍= 𝐸[𝑠2(𝜃)]
= 5357877.39 = 0.990491
𝑛+ 13+
𝑉𝑎𝑟[𝑚(𝜃)] 42931774.5

After determining that our credibility factor Z is approximately 0.99049, we can now perform estimates for the number of
new business start-ups we would expect in the regions for the following quarter i.e. Q2 for 2020.

The estimation formula we use is.

𝑍𝑋̅𝑖 + (1-Z)μ where μ=9723.46154

Where 𝑍 is our credibility factor we have already defined, μ is our prior mean i.e. E[m(θ)] and 𝑋̅𝑖 is our mean from our
sample data. We now use this formula to determine the expected number of new start-ups next quarter for all the regions.

Region of interest Credibility factor Z Number of projected company


start-ups in Q2 2020
North East 0.990491 2436
North West 0.990491 10662
Yorkshire and the Humber 0.990491 5987
East Midlands 0.990491 6057
West Midlands 0.990491 8528
East 0.990491 8936
London 0.990491 24911
South East 0.990491 13598
South West 0.990491 6396

Part iii)

From comparing our answers from part i and ii we can see that in part ii the credibility factor obtained was a bit higher at
0.99049 whereas the credibility factor in part i was 0.9853, this aligns with the general behaviour that we would expect
from credibility factors, as in the second example we have additional data about the number of new company start-ups
from 2017 and for the first quarter of 2020 available to us, so we would expect the credibility factor to be higher as
essentially, we would place even more confidence and relevance in our risk data. The general pattern for the credibility
factor is that as n (number of years’ worth of data) increase the credibility factor tends to 1. In both examples the
predictions for the number of new start-ups are quite close to their corresponding sample mean estimates due to the
credibility factor being very high and therefore less emphasis has been placed on the prior mean E[m(θ)].
Nonetheless, a limitation with using the EBCT model 1 approach is that the model makes no adjustments for the risk
volume data. So this very important piece of information is ignored when using EBCT model 1. Furthermore, model 1
requires more assumptions about the data compared to model 2. So this could to an extent raise some question marks
over the accuracy of the predictions under the EBCT model 1 approach.

Part iv)

For this part see “file 5”

For the EBCT model 2 approach we also consider the risk volume which denotes the amount of “business occurring” (as
well as the aggregate claim amounts) which in this example is the population figures for each region in the respective time
quarters between Q1 2017 to Q2 2020. For the EBCT model 2 to be applicable we were given the population figures for the
different regions for Q2 2020 as we need to multiply our estimates (which will be the number of new company start-ups
per person) by the underlying population size in that region for Q2 2020. In terms of the risk volume calculations I have
used the following formulas:

𝑃̅𝑖 = ∑𝑛𝑗=1 𝑃𝑖𝑗

𝑃̅ = ∑𝑁 ̅
𝑖=1 𝑃𝑖

1 𝑃̅
𝑃∗ = ∑𝑁 ̅ 𝑖
𝑖=1 𝑃𝑖 (1 − ̅ )
𝑁𝑛−1 𝑃

I get the following values

𝑃̅ =729624469.5

𝑃 ∗ =5525564.286

Then after using the following formulas:


𝑃𝑖𝑗 𝑋𝑖𝑗
𝑋̅𝑖 = ∑𝑛𝑖=1 ̅
𝑃𝑖

𝑛 𝑃𝑖𝑗 𝑋𝑖𝑗
𝑋̅ = ∑𝑁
𝑖=1 ∑𝑗=1 ̅ 𝑃
I get the following values for 𝑋̅𝑖 for the 9 different regions of interest and also get 𝑋̅ = 0.0141858

i X BAR i

1 0.0008242

2 0.001352352
3 0.001005
4 0.001149
5 0.001322
6 0.001312
7 0.002485
8 0.001355
9 0.001049

I will now use the following formulas to calculate the parameters of the EBCT model 2:
1 1
𝐸[𝑠 2 (𝜃)] = ∑𝑁
𝑖=1 { ∑𝑛𝑗=1 𝑃𝑖𝑗 (𝑋𝑖𝑗 − 𝑋̅𝑖 )2 }
𝑁 𝑛−1

1 1 1 1
𝑉𝑎𝑟[𝑚(𝜃)] = ( ∑𝑁 𝑛 ̅ 2 𝑁
𝑖=1 ∑𝑗=1 𝑃𝑖𝑗 (𝑋𝑖𝑗 − 𝑋 ) − ∑𝑖=1 { ∑𝑛𝑗=1 𝑃𝑖𝑗 (𝑋𝑖𝑗 − 𝑋̅𝑖 )2 })
𝑃∗ 𝑁𝑛−1 𝑁 𝑛−1

I get the following:

𝐸[𝑠 2 (𝜃)] = 1.0260267


𝑉𝑎𝑟[𝑚(𝜃)] = 0.0001645262

Now we can determine the credibility estimate for all the regions by the formula:
∑𝑛
𝑗=1 𝑃𝑖𝑗
𝑍𝑖 = 𝐸[𝑠2 (𝜃)]
+∑𝑛
𝑗=1 𝑃𝑖𝑗
𝑉𝑎𝑟[𝑚(𝜃)]

The credibility factor for the different regions is set out below:
𝑛
Region 𝐸[𝑠 2 (𝜃)] Credibility factor
∑ 𝑃𝑖𝑗 𝑉𝑎𝑟[𝑚(𝜃)]
𝑗=1
North East 37325197.23 6236.2510048 0.999832949

North West 102580517.9 6236.2510048 0.99993921

Yorkshire and the 76953777.6 6236.2510048 0.999918968


Humber
East Midlands 67622199.36 6236.2510048 0.999907787

West Midlands 82988953.25 6236.2510048 0.99992486

East 87191322.61 6236.2510048 0.999928481

London 125250290.6 6236.2510048 0.999950212

South East 128356326.9 6236.2510048 0.999951417

South West 78681050.18 6236.2510048 0.999920746

Therefore the estimates for company start-ups in Q2 2020 per person/individual in each region by the EBCT model 2
approach is

𝑋̅𝑖 (𝑍𝑖 ) + 𝑋̅(1 − 𝑍𝑖 )

Where 𝑋̅ is 0.0141858 and 𝑍𝑖 is the credibility factor for region 𝑖. The following estimates for the number of company start-
ups for Q2 2O20 are set out below. To obtain the estimates for the number of company start-ups for Q2 2020 we need to
multiply the average company start up per person by the population size in that respective region of interest in Q2 2020.

Region ̅𝒊
𝑿 𝒁𝒊 Estimates for company
start-ups in Q2 2020
North East 0.0008242 0.999832949 2234

North West 0.001352352 0.99993921 10095

Yorkshire and the Humber 0.001005 0.999918968 5615

East Midlands 0.001149 0.999907787 5692

West Midlands 0.001322 0.99992486 8017

East 0.001312 0.999928481 8330

London 0.002485 0.999950212 22764

South East 0.001355 0.999951417 12651

South West 0.001049 0.999920746 6021


part v)

EBCT model 2 context explanation:

𝑁- the number of risks i.e. the number of regions in the UK

𝑛- the number of time units (this was given to us on a quarterly basis)

𝑖- the risk i.e. the region

𝑗- the specific time quarters

𝑌𝑖𝑗 - this is the total number of company start-ups in that particular region for that specific time quarter.

𝑃𝑖𝑗 - this is the population size in region 𝑖 in quarter 𝑗 i.e. risk volume.

𝑋𝑖𝑗 - the total number of company start-ups per person in region 𝑖 in quarter 𝑗.

𝑋̅𝑖 is defined as the average number of company start-ups per person in a particular region 𝑖 across all the time units.

𝑝̅𝑖 is defined as the sum of all 𝑃𝑖𝑗 terms across all the time units for a particular region 𝑖.

𝑃̅ is the sum of all 𝑃̅𝑖 terms.

𝐸[𝑚(𝜃)] = 𝑋̅ is the overall mean which is defined as the total number of business start-ups across all regions across all the
time units divided by 𝑃̅

̅̅̅)2 is the error in each region in each time unit weighted by the population in that region in that specific
𝛼 = 𝑃𝑖𝑗(𝑋𝑖𝑗 − 𝑋𝑖
time unit.

β=𝑃𝑖𝑗 (𝑋𝑖𝑗 − 𝑋̅)2 is the error for each data point against the overall mean number of company start-ups again weighted by
the population in that region and that time unit.
∑𝛼
𝐸[𝑠 2 (𝜃)] which is defined as
𝑁𝑛−1

1 𝑃̅
𝑃 ∗ which is ∑ 𝑃̅𝑖 (1 − ̅𝑖 )
𝑁𝑛−1 𝑃

1 1 1
𝑉𝑎𝑟[𝑚(𝜃)] is defined as (( ∑ 𝛽) − ( ∑ 𝛼))
𝑃∗ 𝑛−1 𝑁(𝑛−1)

Part vi)

Similar to the EBCT model 1 the credibility factors obtained for all the regions under the EBCT model 2 were very high due
to the high amount of available years’ worth of data. However, in my opinion the estimates for company start-ups in Q2
2020 under the EBCT model 2 approach are more valid and more relevant as they take into the account the risk volume in
each region which can often vary with time i.e. in this example it was the population size of the regions of interest whereas
the EBCT model 1 approach doesn’t take this into account in predicting the number of new company start-ups. Hence the
estimates under the EBCT model 2 are more likely to be more representative and more accurate as the quantities 𝑋𝑖𝑗 will
adjusted for varying levels of business for each time period. Hence, this is why the EBCT model 2 is more used and useful
than the EBCT model 1 in many real-life practical situations.

Part vii)

These estimates may be problematic in practice because under the EBCT model 2 we assume that for each region I the
distribution of 𝑋𝑖𝑗 depends on a parameter value 𝜃𝑖 whose value is the same for each j but is unknown.

Question 2)

Part i)

Pearson’s correlation coefficient is a very powerful measure of the potential relationship between two variables. Below is
the correlation analysis between the exam scores from the 1st year and the final degree award/marks.
Here is the Pearson corelation coefficient analysis for the year 2 scores and the final degree marks

The Pearson correlation coefficient in both cases is positive indicating a positive correlation relationship however the
correlation coefficient between the year 2 marks and the overall degree awards is a lot higher at 0.9316 compared to the
coefficient of 0.726. This can possibly be explained by the fact that the marks in year 1 don’t actually count or contribute to
the final degree award for the students whereas the marks for year 2 do contribute and have a weighting towards the final
mark. So, a higher mark in year 2 will contribute to a higher overall degree mark hence why the Pearson correlation value is
very high in this case.

Part ii)

Please see “file 1”

Spearman’s correlation coefficient measures the strength of a monotonic relationship between two variables however the
relationship doesn’t necessarily need to be a linear relationship.

To determine the Spearman correlation coefficient we need to use the following generic formula given that we have
ties/repeated scores in some of the exam scores for the students.

1
∑𝑖 𝑟(𝑋𝑖 )𝑟(𝑌𝑖 ) − ∑𝑖 𝑟(𝑋𝑖 ) ∑𝑖 𝑟(𝑌𝑖 )
𝑟𝑠 = 𝑛
1 1
𝑠𝑞𝑟𝑡 {(∑𝑖[𝑟(𝑋𝑖 )]2 − [∑𝑖 𝑟(𝑋𝑖 )]2 )(∑𝑖[𝑟(𝑌𝑖 )]2 − [∑𝑖 𝑟(𝑌𝑖 )]2 )}
𝑛 𝑛
The main findings from using this method( comparing the 1st year scores with the final degrees scores) are set out in the
table below.

∑ 𝑟(𝑋𝑖 )𝑟(𝑌𝑖 ) 1376755


𝑖

1 1113832.7
∑ 𝑟(𝑋𝑖 ) ∑ 𝑟(𝑌𝑖 )
𝑛 𝑖 𝑖

1 368128.1753
Sqrt(∑𝑖[𝑟(𝑋𝑖 )]2 − [∑𝑖 𝑟(𝑋𝑖 )]2 )(∑𝑖[𝑟(𝑌𝑖 )]2 −
𝑛
1
[∑𝑖 𝑟(𝑌𝑖 )]2
𝑛

𝑟𝑠 1376755 − 111382.7
= 0.714214089
368128.1753

Similarly the main findings from using Spearman’s generic correlation coefficient formula between the year 2 scores and
the final scores are shown in the table below (like before the dependent variable i.e. what is being studied is the final
marks and the year 2 scores is the regressor variable)

∑ 𝑟(𝑋𝑖 )𝑟(𝑌𝑖 ) 1458132


𝑖
1 1115730.03
∑ 𝑟(𝑋𝑖 ) ∑ 𝑟(𝑌𝑖 )
𝑛 𝑖 𝑖
1 367598.9192
Sqrt(∑𝑖[𝑟(𝑋𝑖 )]2 − [∑𝑖 𝑟(𝑋𝑖 )]2 )(∑𝑖[𝑟(𝑌𝑖 )]2 −
𝑛
1 2
[∑𝑖 𝑟(𝑌𝑖 )]
𝑛

𝑟𝑠 1458132−1115730.03
= 0.931454588
367598.9192

The values for the spearman’s correlation coefficient for both pairs of data are slightly lower than the correlation values
determined by Pearson’s method. For the correlation between the year 1 scores and the final scores gives a Pearson
coefficient of approximately 0.7265 whereas the correlation value via the Spearman method is 0.714214, likewise the
correlation coefficients between the year 2 scores and the final year scores are 0.93163 and 0.9314546 under the Pearson
and Spearman method, respectively. This can be explained by the fact that there were some repeated exam marks seen
across the year 1, year 2 and final marks, consequently under the Spearman approach there will be some ties in the
rankings which slightly reduces the correlation coefficient as we take the average of the ranks.

Part iii)

When performing inference on the underlying population correlation coefficient ρ we use the fact that under the Fisher Z
transformation:
1 1+𝑟 1 1+𝜌 1
𝐼𝑓 𝑊 = 𝑙𝑛 𝑡ℎ𝑒𝑛 𝑊 𝑖𝑠 𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑒𝑙𝑦 𝑁( 𝑙𝑛 , )
2 1−𝑟 2 1−𝜌 𝑛−3

Where r is the correlation coefficient based on our sample, n is the number of datapoints available to us and ρ is our
population correlation coefficient under the null hypothesis. We are testing the null hypothesis that the population
correlation coefficient ρ is equal to 0.5 against the alternative hypothesis that ρ is more than 0.5. Under the null hypothesis
a correlation value of 0.5 suggests that the two variables are moderately correlated.

Test for year 1 scores and final scores:

𝑟 = 0.726476

𝑊 = 𝑡𝑎𝑛ℎ−1 (0.726476) = 0.921224


1 1
𝑊 ℎ𝑎𝑠 𝑎 𝑁( 𝑙𝑛3, )
2 161

𝑃(𝑊 > 0.921224)

𝑃(𝑍 > 4.7191) = 0.000001 so p value less than significance level of 5% thus we reject the null hypothesis, so we can
conclude that a population correlation coefficient of 0.5 isn’t sensible or appropriate.

Test for year 2 scores and final scores:

𝑟 = 0.931634

𝑊 = 𝑡𝑎𝑛ℎ−1 (0.931634) = 1.670623


1 1
𝑊 ℎ𝑎𝑠 𝑎 𝑁( 𝑙𝑛3, )
2 161

𝑃(𝑊 > 1.670623)

𝑃(𝑍 > 14.2279) 𝑠𝑜 𝑝 𝑣𝑎𝑙𝑢𝑒 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑡𝑙𝑦 𝑙𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 5% ℎ𝑒𝑛𝑐𝑒 𝑟𝑒𝑗𝑒𝑐𝑡 𝑛𝑢𝑙𝑙 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠. Thus, a population correlation
coefficient of 0.5 is unrealistic. So the data suggests the population correlation coefficient is more than 0.5.

Part iv)

Please see “file 2”

For the bivariate linear model we need to estimate the coefficients of linear regression by the method of least squares. We
need to estimate α and β (intercept and slope parameter) as well as the error variance 𝜎 2 . The bivariate regression line is
given by

̂
𝑦̂ =𝛼̂ +𝛽𝑥

I have used the following formulas in my calculations to determine the relevant coefficients of the bivariate linear model:
𝑆𝑥𝑦
𝛽̂ =
𝑆𝑥𝑥

𝑆𝑥𝑦 = ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 -𝑦̅)

𝑆𝑥𝑥 = ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2

𝑆𝑦𝑦 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2

1 𝑆𝑥𝑦2
̂2 =
𝜎 (𝑆𝑦𝑦 − )
𝑛−2 𝑆𝑥𝑥

𝛼̂ =𝑦̅-𝛽̂ 𝑥̅

Also, given that the bivariate linear regression model we are interested in is between the year 2 marks and the final degree
marks then the regressor variable X is the year 2 scores whereas the dependent variable/response variable Y is the final
degree scores.

I get the following values:

∑ 𝑥𝑖 2 𝑥̅ =66.4448 𝑛=164 𝑆𝑥𝑥 = ∑ 𝑥𝑖 2 − 𝑆𝑥𝑥 = 𝑆𝑥𝑥 =


𝑛𝑥̅ 2 743485.743 − 19440.26569
= 743485.743 164(66.44482 )
∑ 𝑦𝑖 2 𝑦̅ = 66.3586 𝑛=164 𝑆𝑦𝑦 = ∑ 𝑦𝑖 2 − 𝑆𝑦𝑦 = 𝑆𝑦𝑦 =
𝑛𝑦̅ 2 740739.322 − 18571.25979
= 740739.322 164(66.35862 )
∑ 𝑥𝑖 𝑦𝑖 𝑥̅ =66.4448 𝑦̅ = 66.3586 𝑛=164 𝑆𝑥𝑦 = ∑ 𝑥𝑖 𝑦𝑖 − 𝑆𝑥𝑦 =
̅̅̅̅
𝑛𝑥𝑦 17701.89153
= 740808.052
̂
𝛽=
𝑆𝑥𝑦
𝛽̂ =
17701.89153
𝛽̂ = 0.910578 ̂
𝛼̂ = 𝑦̅ − 𝛽𝑥̅ 𝛼̂ = 66.3586 − 𝛼̂ = 5.8554
𝑆𝑥𝑥 19440.26569
0.910578 ∗
66.4448

So the linear coefficients of the bivariate linear model are a slope parameter of 0.910578, an intercept of 5.8554 and an
error variance of 15.143. Therefore, our estimate for the bivariate linear regression line is 𝑦̂=5.8554 + 0.910578𝑥.

To verify these coefficients of the bivariate linear model I’ve produced a scatter plot showing the final marks against the
year 2 marks below:
The scatterplot obviously shows strong positive correlations and the trendline has an equation of 𝑦 = 0.9106𝑥 + 5.854
hence our estimates for α and β are an appropriate fit for the data.

Furthermore we can look at the variability explained by the model and the remaining variability that is unexplained by the
model. We use the following concept called the coefficient of determination which states:

𝑺𝑺𝑹𝑬𝑮
𝑹𝟐 =
𝑺𝑺𝑻𝑶𝑻

Where 𝑆𝑆𝑇𝑂𝑇 is the total variation in responses and 𝑆𝑆𝑅𝐸𝐺 is the level of variability that is credited to the relationship
between the dependent and regressor variable i.e. the variability that is explained by the regression model. So the
coefficient of determination gives us a proportion of how much variability is explained by the bivariate linear regression
model.

Applying the relevant quantities I get


16118.96505
𝑅2 = = 0.867952
18571.25979

This value is very high which indicates that the majority of variation is explained by the bivariate model and very little is left
over in residual variation. This supports that the bivariate linear regression model is a good fit for the data.

Part v)

To test the adequacy of the bivariate linear regression model obtained in the previous part we use the ANOVA test
assuming that under the null hypothesis that the slope parameter β is 0 suggesting no linear relationship between the two
variables against the alternative hypothesis that the slope parameter isn’t 0. Unlike the coefficient of determination
calculated in the previous part, the ANOVA test allows us to do a more formal test of fit taking into account the underlying
distributional assumptions. In this test we use the fact that
𝑆𝑆𝑅𝐸𝑆 2
~𝜒𝑛−2
(𝑛−2)

𝑆𝑆𝑇𝑂𝑇 2
~𝜒𝑛−1
𝑛−1

𝑆𝑆𝑅𝐸𝐺
𝑔𝑖𝑣𝑒𝑛 𝑆𝑆𝑅𝐸𝑆 𝑎𝑛𝑑 𝑆𝑆𝑅𝐸𝐺 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑡ℎ𝑒𝑛 ~𝜒12
𝜎2

By our bivariate linear regression model that we obtained in the previous part we had the following quantities:

𝑆𝑆𝑇𝑂𝑇 = 18571.25979 𝑆𝑆𝑅𝐸𝐺 = 16118.96505 𝑆𝑆𝑅𝐸𝑆 = 2452.29474

The ANOVA table is set out below( note n=164)

Source of variation Degrees of freedom Sum of squares Mean sum of squares


Regression 1 16118.96505 16118.96505

residual 162 2452.29474 15.13762


total 163 18571.25979

𝑀𝑆𝑆𝑅𝐸𝐺 16118.96505
The test statistic we use is = = 1064.828
𝑀𝑆𝑆𝑅𝐸𝑆 15.13762

So we test 1064.828 on a F(1,162) distribution. This generates a very small p value hence we reject 𝐻0 : 𝛽 = 0

As a result from performing the ANOVA test, we can conclude that there is a linear relationship between the year 2 and
final marks as we reject the null hypothesis which states that there isn’t a relationship.

Part v)

Please see “file 3”


We assume that the residuals of the bivariate linear regression model are independent and identically distributed with a
normal distribution of 𝑁(0, 𝜎 2 ).

The residuals (what’s left over) can be calculated by the formula:

𝒆̂𝒊 = 𝒚𝒊 − 𝒚
̂𝒊

Where 𝑦𝑖 is the observed value of the dependent variable and 𝑦̂𝑖 is the value of the dependent variable that is predicted by
our bivariate linear regression model. Recall that the bivariate linear regression model was 𝑦̂𝑖 = 5.8554 + 0.910578𝑥𝑖

To determine whether the bivariate linear model is a good fit for the data we need to test the residual marks for normality.
Here I have performed two tests to check this assertion: the normal probability plot (QQ plot), and a scatterplot.

Normal Probability Plot


4

0
-15.00000 -10.00000 -5.00000 0.00000 5.00000 10.00000 15.00000

-1

-2

-3

-4

From checking the QQ plot, the residuals marks seem to follow a normal distribution reasonably well, if the residuals were
to follow a normal distribution, we would expect the points to follow a straight diagonal line and we can see most of the
data points are on the trendline or near to it although the data point to the furthest right deviates a little bit from the
trendline possibly indicting some deficiency within the bivariate linear model. However, overall I think the linear model is a
good fit as most of the point are close to the line.
In addition we can plot the residual marks against the explanatory variable in a scatterplot/residual plot as shown below:

Residual Plot
30

20

10
Residuals

0
-15.00000 -10.00000 -5.00000 0.00000 5.00000 10.00000 15.00000

-10

-20

-30
residuals

According to the scatterplot of residual marks against the explanatory variable i.e. the year 2 scores, there doesn’t seem to
be any concrete pattern or relationship between the two variables instead the data points are randomly scattered and
varied throughout and there isn’t really anything to indicate that this isn’t a random pattern. This supports the fact that the
residual scores are independent of the explanatory variables as there isn’t a relationship between them that we can deduct
from the scatterplot. So, the random scatter supports that the residuals marks are normally distributed.

Overall, I would say that the bivariate linear regression models seem to be a good fit for the data. There are a couple of
discrepancies shown in the QQ plot but I believe they are small and minor discrepancies. Furthermore the coefficient of
determination we calculated earlier on indicated that approximately 87% of the overall variability is explained by the linear
bivariate model thus suggesting a good fit for the data. This in conjunction with the QQ plot and scatterplot which shows
that the data points are more varied and spread leads me to my conclusion that the bivariate linear regression model is a
sensible fit for the data.

Question 3:

Part i)
below is a correlation coefficient matrix showing the correlation between the module marks scored by students and the
final degree mark.

The numbers in yellow indicter very strong correlation the ones in green indicates moderately strong positive correlation,
the ones in blue indicate weakly moderate correlation and the red ones indicate weak correlation.

Part ii)

to choose a suitable multiple linear regression model and determine the number of parameters to include in the regression
model I have decided to use the forward selection approach. To start I selected the covariate with the highest correlation
coefficient with respect to the dependent variable which is the final degree scores. Then I choose the next parameter
based on which covariate increases the adjusted r squared value by the most amount and continue in this way until the
adjusted r squared value is maximised. So based on this, the first covariate I chose I the module scores for the module
Y2M7. This gives me the following:

By the forward selection approach we choose to add the next covariate based on which covariate improves our adjusted
coefficient of determination by the greatest amount. Hence the next covariate we add is Y2M6. We get the following.
As you can see adding the covariate of the Y2M6 marks improve the adjusted r squared value. Based on this approach the
next covariate we add is Y2M3. We get the following:

The next covariate we add is the Y2M2 scores:


Next we add the Y2M8 scores

Next we add the Y2M5 scores


Next we add the Y2M4 marks:

Next we add the Y2M1 marks:

Next we add the Y1M7 module scores for the students:


The next covariate we add is the scores of module Y1M8:

The final covariate we add is the scores for Y1M1:


Upon further analysis there are no more explanatory variables that we can add to this model without decreasing the
adjusted numerical value for the coefficient of determination hence the multiple linear model obtained here maximises
the adjusted r squared value at 0.87336 and is optimised. Therefore the multiple linear regression model we have by the
forward selective approach is:

𝑦 = 7.586188 + 0.047943𝑥1 − 0.07905𝑥2 + 0.02527𝑥3 + 0.082891𝑥4 + 0.0971865𝑥5 + 0.09694𝑥6 + 0.064159𝑥7 +


0.086145𝑥8 + 0.16545𝑥9 + 0.162355𝑥10 + 0.129127𝑥11 where 𝑥1 , 𝑥2 , … 𝑥11 are the covariates/explanatory variables i.e.
the scores for the various module’s students took and 7.586188 is the intercept term.

Part iii)

To assess the adequacy of the multiple linear regression model I decided to use the ANOVA test, the ANOVA analysis is set
out below.

comparing the f test p value to the significance level I have chosen at 5% (as this is the most commonly used significance
level) the p value from this ANOVA test is significantly less than 5% therefore we reject the null hypothesis so the sampled
data provides sufficient evidence to conclude that the multiple linear regression model fits the data better compared to
the model with no independent variables.

Part iv)

please see “file 4”

to see whether the multiple linear regression model is a good fit for the data we need to check the residual scores and test
them for the normality assumption. Like with the bivariate linear model the residuals are calculated by the difference
between the observed values of the dependent variable and the value of the dependent variable that is predicted by the
multiple linear regression model.

Below is the normal probability plot for the residual scores


Normal Probability Plot
4

0
-15.0000000 -10.0000000 -5.0000000 0.0000000 5.0000000 10.0000000 15.0000000

-1

-2

-3

-4

The results from the normal quantile probability plot are quite promising, the residual marks seem to follow a normal
distribution pattern very well as most of the data points are on the straight trendline or near to it although there is the
data point to the furthest left which deviates from the trendline relative to the other data points so potentially we may
have a possible outlier. However, overall the linear model seems like a very sensible fit for the data.

Dotplot of residuals
10

0
-4 -3 -2 -1 0 1 2 3 4
The dot plot showing standardised residuals seems to be a random pattern and the residuals scores are randomly
dispersed around the horizontal axis going from approximately -3.5 to 3, suggesting that the multiple linear regression
model is a good fit for the data and that the residual scores follow a normal distribution.

Histogram of standardized
residuals.
70

60

50
Frequency

40

30
Frequency
20

10

0
-3 -2 -1 0 1 2 3 More
bins

Also, the histogram chart showing the standardized residual scores seem to follow a normal distribution very well, the
histogram is almost perfectly symmetric similar to the standard normal distribution. Clearly there are no heavy skews or
outliers or heavy tail ends that suggests the residuals scores don’t follow a normal distribution. Hence, from performing
this check we can assert that the residual scores fit the normal distribution assumption very well and therefore the
multiple regression model is a sensible fit for the data.

Part v)

Overall, the bivariate and the multiple linear regression models seem like good fits for the data, although I believe there
are some potential deficiencies within both models shown in their relative quantile probability plots, however, these seem
like minor deficiencies within the model. The normal probability plot for the multiple linear regression model has more
point closer to the trendline than the quantile probability plot for the bivariate model potentially indicating more normality
in the residual marks for the multiple linear regression model. Furthermore, the adjusted value of the coefficient of
determination for the multiple linear model is 0.87336 slightly greater than the adjusted r value for the bivariate model at
0.867127. As well as this, the histogram I plotted strongly supported that the residual scores for the multiple linear
regression model followed a normal distribution very well. So, taking all of this into account leads me to my conclusion that
the multiple linear regression model is a more suitable fit for the data compared to the simple bivariate linear regression
model.

Question 4)

Generalised linear models are essentially an extension of the basic simple linear model with which you are already familiar
with. Similar to the simple linear regression model, the generalised linear model looks at the effect that the
covariates/explanatory variables have on the behaviour of the dependent variable (the variable that we are interested in).
However, as you recall with the simple bivariate linear model and the multiple linear regression model, one of the main
assumptions was that the response variable Y was limited to a normal distribution with 𝑌𝑖 ~𝑁(𝜇𝑖 , 𝜎 2 ). Now, with the
generalised linear model there’s a much greater deal of flexibility in the sense that the 𝑌𝑖 is able to take a range of
distributional forms in the exponential family such as the binomial, normal, Poisson, gamma, exponential, chi squared.

Clearly using the generalised linear model would be more advantageous and preferable over the basic linear model in
several real-life situations. For example, if an insurance company was interested in looking at how the size of car insurance
claims is affected by several factors like the age and gender of the policyholder, years of driving experience, make/model of
vehicle and the credit history then clearly using the assumptions of a normal distribution when looking at the size of car
insurance claims wouldn’t be appropriate and instead would likely be very problematic as claim sizes can never be negative
so using a basic linear model under the normal assumption may not generate reliable or meaningful results.

More importantly, there are 3 main components you need to know when constructing a generalised linear model which
are:

• A particular distribution for the response/dependent variable


• A linear predictor denoted by η
• A link/logit function

The linear predictor is a function of the covariates, in the simple bivariate case the linear predictor would be α+βx. So the
linear predictor effectively shows the relationship between the responsible variable and the covariates in the linear model.
Moreover, when using GLMs we can extend the model to include interaction effects between covariates known as an
interaction term as well as main effects acting in isolation. For example, if we were using a GLM to look at the level of
deaths related to cardiovascular disease and as part of the GLM we included the covariates of smoking, inactivity, and
weight, we would probably want to have an interaction term between inactivity and weight as these covariates are likely to
be strongly dependent on each other. So a possible linear predictor for this model may look like this:

η=𝛼𝑖 + β𝑥1 + 𝛾𝑥2 + 𝛿𝑥1 𝑥2

where 𝛼𝑖 is the categorical variable of smoking where i=1 if a person smokes and i=0 if a person doesn’t smoke, 𝑥1 is the
variable measuring the level of inactivity, 𝑥2 is the variable measuring weight and 𝑥1 𝑥2 is the interaction term combining
weight and inactivity and β,γ, δ are the respective coefficients of the covariates and the interaction variable.

Finally, the link function is a function which links the linear predictor and the mean response μ where the link function is
denoted by g(μ) where μ=E(Y). for the simple linear model the mean response and the link function were equal but in
GLMs this may not necessarily be the case. However for a link function to be valid it must be invertible i.e. Its inverse must
exist and also has to be differentiable. Furthermore, for each distribution we have a natural link function also known as a
canonical link function. These are set out below

Probability distribution Link function


Normal distribution g(μ)=μ
1
Gamma distribution g(μ)=
𝜇
Poisson distribution g(μ)=log(μ)
Binomial distribution 𝜇
g(μ)=log( )
1−𝜇

Bibliography

06 Empirical Bayes Credibility theory.docx - 06 Empirical Bayes Credibility theory 1 Empirical Bayes Credibility Theory
Model 1 Describe the | Course Hero

Empirical Bayes Credibility Models for Economic Catastrophic Losses by Regions (itm-conferences.org)

AN INTRODUCTION TO GENERALIZED LINEAR MODELS (wordpress.com)

General Linear Models (GLM) (netdna-ssl.com)

Spearman's Rank-Order Correlation - A guide to when to use it, what it does and what the assumptions are. (laerd.com)

You might also like