0% found this document useful (0 votes)

216 views

Statistical Data Analysis Full Project

The document discusses empirical Bayesian credibility theory (EBCT) models for estimating the number of new business startups in different regions of the UK. [1] It applies EBCT Model 1 to estimate startups for Q1 2020 using historical startup data from 2017-2019. [2] When more data is available for 2013-Q1 2020, it recalculates the estimates using EBCT Model 1, finding a higher credibility value of 0.99 compared to 0.985 previously. [3] It notes EBCT Model 1 ignores risk volume data like population, which EBCT Model 2 accounts for using population figures to adjust the estimates.

Uploaded by

ash

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

216 views

Statistical Data Analysis Full Project

Uploaded by

ash

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Ashley Pancholi

Statistical data analysis project MA2206

Question 1

Part i)

The EBCT model 1 is a model which incorporates Bayesian thinking and can be seen as a generalisation of the
normal/normal model. Unlike in standard Bayesian credibility our aim with the empirical Bayesian credibility model 1 is to
estimate not θ but some function of θ given by m(θ). When using the EBCT model 1 we need to define the following
notation:

𝑋𝑖𝑗=𝑡ℎ𝑒 𝑎𝑔𝑔𝑟𝑒𝑔𝑟𝑎𝑡𝑒 𝑐𝑙𝑎𝑖𝑚𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑗𝑡ℎ 𝑦𝑒𝑎𝑟 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑖𝑡ℎ 𝑟𝑖𝑠𝑘
1
𝑋̅𝑖 = ∑𝑛𝑗=1 𝑋𝑖𝑗
𝑛

1
𝑋̅ = ∑𝑁 ̅
𝑖=1 𝑋𝑖
𝑁

Here are the Parameter estimations I have used in my calculations:

Quantity Estimator

E[𝑚(𝜃)] 𝑋̅
1 1
E[𝑠 2 (𝜃)] ∑𝑁
𝑖=1 { ∑𝑛𝑗=1(𝑋̅𝑖𝑗 − 𝑋̅𝑖 )^2}
𝑁 𝑛−1

1 1 1
Var[𝑚(𝜃)] ∑𝑁 ̅ ̅
𝑖=1(𝑋𝑖 − 𝑋 )^2 − ∑𝑁
𝑖=1 { ∑𝑛𝑗=1(𝑋𝑖𝑗 − 𝑋̅𝑖 ) ^2}
𝑁−1 𝑁𝑛 𝑛−1

Credibility formula:
𝑛
𝑍= 𝐸[𝑠2 (𝜃)]
𝑛+
𝑉𝑎𝑟[𝑚(𝜃)]

When using the EBCT model 1 to model the number of new businesses/companies by quarter on quarter for various
regions the 𝑋𝑖𝑗 represents the total number of new companies per quarter in the jth year from the ith risk. Here j
represents the time units we are working in which is in quarters and i denotes the different regions. Also in this example n
denotes the number of time units of available data and N denotes the number of types of risk. Hence in this example n=8
and N=9

From performing the relevant intermediate calculations under the assumption of EBCT model 1 we can now determine our
credibility factor Z given by
𝑛 8
𝑍= 𝐸[𝑠2 (𝜃)]
= 5225846.28 = 0.985268
𝑛+ 8+
𝑉𝑎𝑟[𝑚(𝜃)} 43688833.1

After determining that our credibility factor Z is approximately 0.9853, we can now perform estimates for the number of
new business start-ups we would expect in the regions for the following quarter i.e. Q1 for 2020.

The estimation formula we use is.

𝑍𝑋̅𝑖 + (1-Z)μ

Where 𝑍 is our credibility factor we have already defined, μ is our prior mean i.e. E[m(θ)] and 𝑋̅𝑖 is our mean from our
sample data. We now use this formula to determine the expected number of new company start-ups we would expect
under EBCT model 1 next quarter for all the regions.

Region of interest Credibility factor Projected number of company start-

ups in Q2 2020
North East 0.9583 2459
North West 0.9583 10386
Yorkshire and the Humber 0.9583 6070
East Midlands 0.9583 6036
West Midlands 0.9583 8590
East 0.9583 8983
London 0.9583 25015
South East 0.9583 13840
South West 0.9583 6312

From performing the estimate calculations we can see the projected figures for q1 2020 are reasonably close to their
corresponding sample mean for each region, respectively. This is because the credibility factor we calculated was very high
at 0.9853 because we essentially place more confidence and relevance in the risk data hence the estimates, we have
calculated are an average but is weighted much more towards the risk data and less relevance is placed on the collateral
data.

Part ii)

Using the assumption of the EBCT model 1 again we define the following quantities:

1
𝑋̅ = ∑𝑁 ̅
𝑖=1 𝑋𝑖
𝑁

Here are the Parameter estimation formulas I have used in my calculations:

Quantity Estimator

E[𝑚(𝜃)] 𝑋̅
1 1
E[𝑠 2 (𝜃)] ∑𝑁
𝑖=1 { ∑𝑛𝑗=1(𝑋̅𝑖𝑗 − 𝑋̅𝑖 )^2}
𝑁 𝑛−1

1 1 1
Var[𝑚(𝜃)] ∑𝑁 ̅ ̅
𝑖=1(𝑋𝑖 − 𝑋 )^2 − ∑𝑁
𝑖=1 { ∑𝑛𝑗=1(𝑋𝑖𝑗 − 𝑋̅𝑖 ) ^2}
𝑁−1 𝑁𝑛 𝑛−1

Credibility formula:
𝑛
𝑍= 𝐸[𝑠2 (𝜃)]
𝑛+
𝑉𝑎𝑟[𝑚(𝜃)]
Given we have more available data regarding the amount of new company births/start-ups to us we need to adjust the
value of n as there are 13-time units so n=13.

From performing the relevant intermediate calculations under the assumption of EBCT model 1 we can now determine the
credibility factor:
𝑛 13
𝑍= 𝐸[𝑠2(𝜃)]
= 5357877.39 = 0.990491
𝑛+ 13+
𝑉𝑎𝑟[𝑚(𝜃)] 42931774.5

After determining that our credibility factor Z is approximately 0.99049, we can now perform estimates for the number of
new business start-ups we would expect in the regions for the following quarter i.e. Q2 for 2020.

The estimation formula we use is.

𝑍𝑋̅𝑖 + (1-Z)μ where μ=9723.46154

Region of interest Credibility factor Z Number of projected company

start-ups in Q2 2020
North East 0.990491 2436
North West 0.990491 10662
Yorkshire and the Humber 0.990491 5987
East Midlands 0.990491 6057
West Midlands 0.990491 8528
East 0.990491 8936
London 0.990491 24911
South East 0.990491 13598
South West 0.990491 6396

Part iii)

From comparing our answers from part i and ii we can see that in part ii the credibility factor obtained was a bit higher at
0.99049 whereas the credibility factor in part i was 0.9853, this aligns with the general behaviour that we would expect
from credibility factors, as in the second example we have additional data about the number of new company start-ups
from 2017 and for the first quarter of 2020 available to us, so we would expect the credibility factor to be higher as
essentially, we would place even more confidence and relevance in our risk data. The general pattern for the credibility
factor is that as n (number of years’ worth of data) increase the credibility factor tends to 1. In both examples the
predictions for the number of new start-ups are quite close to their corresponding sample mean estimates due to the
credibility factor being very high and therefore less emphasis has been placed on the prior mean E[m(θ)].
Nonetheless, a limitation with using the EBCT model 1 approach is that the model makes no adjustments for the risk
volume data. So this very important piece of information is ignored when using EBCT model 1. Furthermore, model 1
requires more assumptions about the data compared to model 2. So this could to an extent raise some question marks
over the accuracy of the predictions under the EBCT model 1 approach.

Part iv)

For this part see “file 5”

For the EBCT model 2 approach we also consider the risk volume which denotes the amount of “business occurring” (as
well as the aggregate claim amounts) which in this example is the population figures for each region in the respective time
quarters between Q1 2017 to Q2 2020. For the EBCT model 2 to be applicable we were given the population figures for the
different regions for Q2 2020 as we need to multiply our estimates (which will be the number of new company start-ups
per person) by the underlying population size in that region for Q2 2020. In terms of the risk volume calculations I have
used the following formulas:

𝑃̅𝑖 = ∑𝑛𝑗=1 𝑃𝑖𝑗

𝑃̅ = ∑𝑁 ̅
𝑖=1 𝑃𝑖

1 𝑃̅
𝑃∗ = ∑𝑁 ̅ 𝑖
𝑖=1 𝑃𝑖 (1 − ̅ )
𝑁𝑛−1 𝑃

I get the following values

𝑃̅ =729624469.5

𝑃 ∗ =5525564.286

Then after using the following formulas:

𝑃𝑖𝑗 𝑋𝑖𝑗
𝑋̅𝑖 = ∑𝑛𝑖=1 ̅
𝑃𝑖

𝑛 𝑃𝑖𝑗 𝑋𝑖𝑗
𝑋̅ = ∑𝑁
𝑖=1 ∑𝑗=1 ̅ 𝑃
I get the following values for 𝑋̅𝑖 for the 9 different regions of interest and also get 𝑋̅ = 0.0141858

i X BAR i

1 0.0008242

2 0.001352352
3 0.001005
4 0.001149
5 0.001322
6 0.001312
7 0.002485
8 0.001355
9 0.001049

I will now use the following formulas to calculate the parameters of the EBCT model 2:
1 1
𝐸[𝑠 2 (𝜃)] = ∑𝑁
𝑖=1 { ∑𝑛𝑗=1 𝑃𝑖𝑗 (𝑋𝑖𝑗 − 𝑋̅𝑖 )2 }
𝑁 𝑛−1

1 1 1 1
𝑉𝑎𝑟[𝑚(𝜃)] = ( ∑𝑁 𝑛 ̅ 2 𝑁
𝑖=1 ∑𝑗=1 𝑃𝑖𝑗 (𝑋𝑖𝑗 − 𝑋 ) − ∑𝑖=1 { ∑𝑛𝑗=1 𝑃𝑖𝑗 (𝑋𝑖𝑗 − 𝑋̅𝑖 )2 })
𝑃∗ 𝑁𝑛−1 𝑁 𝑛−1

I get the following:

𝐸[𝑠 2 (𝜃)] = 1.0260267

𝑉𝑎𝑟[𝑚(𝜃)] = 0.0001645262

Now we can determine the credibility estimate for all the regions by the formula:
∑𝑛
𝑗=1 𝑃𝑖𝑗
𝑍𝑖 = 𝐸[𝑠2 (𝜃)]
+∑𝑛
𝑗=1 𝑃𝑖𝑗
𝑉𝑎𝑟[𝑚(𝜃)]

The credibility factor for the different regions is set out below:
𝑛
Region 𝐸[𝑠 2 (𝜃)] Credibility factor
∑ 𝑃𝑖𝑗 𝑉𝑎𝑟[𝑚(𝜃)]
𝑗=1
North East 37325197.23 6236.2510048 0.999832949

North West 102580517.9 6236.2510048 0.99993921

Yorkshire and the 76953777.6 6236.2510048 0.999918968

Humber
East Midlands 67622199.36 6236.2510048 0.999907787

West Midlands 82988953.25 6236.2510048 0.99992486

East 87191322.61 6236.2510048 0.999928481

London 125250290.6 6236.2510048 0.999950212

South East 128356326.9 6236.2510048 0.999951417

South West 78681050.18 6236.2510048 0.999920746

Therefore the estimates for company start-ups in Q2 2020 per person/individual in each region by the EBCT model 2
approach is

𝑋̅𝑖 (𝑍𝑖 ) + 𝑋̅(1 − 𝑍𝑖 )

Where 𝑋̅ is 0.0141858 and 𝑍𝑖 is the credibility factor for region 𝑖. The following estimates for the number of company start-
ups for Q2 2O20 are set out below. To obtain the estimates for the number of company start-ups for Q2 2020 we need to
multiply the average company start up per person by the population size in that respective region of interest in Q2 2020.

Region ̅𝒊
𝑿 𝒁𝒊 Estimates for company
start-ups in Q2 2020
North East 0.0008242 0.999832949 2234

North West 0.001352352 0.99993921 10095

Yorkshire and the Humber 0.001005 0.999918968 5615

East Midlands 0.001149 0.999907787 5692

West Midlands 0.001322 0.99992486 8017

East 0.001312 0.999928481 8330

London 0.002485 0.999950212 22764

South East 0.001355 0.999951417 12651

South West 0.001049 0.999920746 6021

part v)

EBCT model 2 context explanation:

𝑁- the number of risks i.e. the number of regions in the UK

𝑛- the number of time units (this was given to us on a quarterly basis)

𝑖- the risk i.e. the region

𝑗- the specific time quarters

𝑌𝑖𝑗 - this is the total number of company start-ups in that particular region for that specific time quarter.

𝑃𝑖𝑗 - this is the population size in region 𝑖 in quarter 𝑗 i.e. risk volume.

𝑋𝑖𝑗 - the total number of company start-ups per person in region 𝑖 in quarter 𝑗.

𝑋̅𝑖 is defined as the average number of company start-ups per person in a particular region 𝑖 across all the time units.

𝑝̅𝑖 is defined as the sum of all 𝑃𝑖𝑗 terms across all the time units for a particular region 𝑖.

𝑃̅ is the sum of all 𝑃̅𝑖 terms.

𝐸[𝑚(𝜃)] = 𝑋̅ is the overall mean which is defined as the total number of business start-ups across all regions across all the
time units divided by 𝑃̅

̅̅̅)2 is the error in each region in each time unit weighted by the population in that region in that specific
𝛼 = 𝑃𝑖𝑗(𝑋𝑖𝑗 − 𝑋𝑖
time unit.

β=𝑃𝑖𝑗 (𝑋𝑖𝑗 − 𝑋̅)2 is the error for each data point against the overall mean number of company start-ups again weighted by
the population in that region and that time unit.
∑𝛼
𝐸[𝑠 2 (𝜃)] which is defined as
𝑁𝑛−1

1 𝑃̅
𝑃 ∗ which is ∑ 𝑃̅𝑖 (1 − ̅𝑖 )
𝑁𝑛−1 𝑃

1 1 1
𝑉𝑎𝑟[𝑚(𝜃)] is defined as (( ∑ 𝛽) − ( ∑ 𝛼))
𝑃∗ 𝑛−1 𝑁(𝑛−1)

Part vi)

Similar to the EBCT model 1 the credibility factors obtained for all the regions under the EBCT model 2 were very high due
to the high amount of available years’ worth of data. However, in my opinion the estimates for company start-ups in Q2
2020 under the EBCT model 2 approach are more valid and more relevant as they take into the account the risk volume in
each region which can often vary with time i.e. in this example it was the population size of the regions of interest whereas
the EBCT model 1 approach doesn’t take this into account in predicting the number of new company start-ups. Hence the
estimates under the EBCT model 2 are more likely to be more representative and more accurate as the quantities 𝑋𝑖𝑗 will
adjusted for varying levels of business for each time period. Hence, this is why the EBCT model 2 is more used and useful
than the EBCT model 1 in many real-life practical situations.

Part vii)

These estimates may be problematic in practice because under the EBCT model 2 we assume that for each region I the
distribution of 𝑋𝑖𝑗 depends on a parameter value 𝜃𝑖 whose value is the same for each j but is unknown.

Question 2)

Part i)

Pearson’s correlation coefficient is a very powerful measure of the potential relationship between two variables. Below is
the correlation analysis between the exam scores from the 1st year and the final degree award/marks.
Here is the Pearson corelation coefficient analysis for the year 2 scores and the final degree marks

The Pearson correlation coefficient in both cases is positive indicating a positive correlation relationship however the
correlation coefficient between the year 2 marks and the overall degree awards is a lot higher at 0.9316 compared to the
coefficient of 0.726. This can possibly be explained by the fact that the marks in year 1 don’t actually count or contribute to
the final degree award for the students whereas the marks for year 2 do contribute and have a weighting towards the final
mark. So, a higher mark in year 2 will contribute to a higher overall degree mark hence why the Pearson correlation value is
very high in this case.

Part ii)

Please see “file 1”

Spearman’s correlation coefficient measures the strength of a monotonic relationship between two variables however the
relationship doesn’t necessarily need to be a linear relationship.

To determine the Spearman correlation coefficient we need to use the following generic formula given that we have
ties/repeated scores in some of the exam scores for the students.

1
∑𝑖 𝑟(𝑋𝑖 )𝑟(𝑌𝑖 ) − ∑𝑖 𝑟(𝑋𝑖 ) ∑𝑖 𝑟(𝑌𝑖 )
𝑟𝑠 = 𝑛
1 1
𝑠𝑞𝑟𝑡 {(∑𝑖[𝑟(𝑋𝑖 )]2 − [∑𝑖 𝑟(𝑋𝑖 )]2 )(∑𝑖[𝑟(𝑌𝑖 )]2 − [∑𝑖 𝑟(𝑌𝑖 )]2 )}
𝑛 𝑛
The main findings from using this method( comparing the 1st year scores with the final degrees scores) are set out in the
table below.

∑ 𝑟(𝑋𝑖 )𝑟(𝑌𝑖 ) 1376755

𝑖

1 1113832.7
∑ 𝑟(𝑋𝑖 ) ∑ 𝑟(𝑌𝑖 )
𝑛 𝑖 𝑖

1 368128.1753
Sqrt(∑𝑖[𝑟(𝑋𝑖 )]2 − [∑𝑖 𝑟(𝑋𝑖 )]2 )(∑𝑖[𝑟(𝑌𝑖 )]2 −
𝑛
1
[∑𝑖 𝑟(𝑌𝑖 )]2
𝑛

𝑟𝑠 1376755 − 111382.7
= 0.714214089
368128.1753

Similarly the main findings from using Spearman’s generic correlation coefficient formula between the year 2 scores and
the final scores are shown in the table below (like before the dependent variable i.e. what is being studied is the final
marks and the year 2 scores is the regressor variable)

∑ 𝑟(𝑋𝑖 )𝑟(𝑌𝑖 ) 1458132

𝑖
1 1115730.03
∑ 𝑟(𝑋𝑖 ) ∑ 𝑟(𝑌𝑖 )
𝑛 𝑖 𝑖
1 367598.9192
Sqrt(∑𝑖[𝑟(𝑋𝑖 )]2 − [∑𝑖 𝑟(𝑋𝑖 )]2 )(∑𝑖[𝑟(𝑌𝑖 )]2 −
𝑛
1 2
[∑𝑖 𝑟(𝑌𝑖 )]
𝑛

𝑟𝑠 1458132−1115730.03
= 0.931454588
367598.9192

The values for the spearman’s correlation coefficient for both pairs of data are slightly lower than the correlation values
determined by Pearson’s method. For the correlation between the year 1 scores and the final scores gives a Pearson
coefficient of approximately 0.7265 whereas the correlation value via the Spearman method is 0.714214, likewise the
correlation coefficients between the year 2 scores and the final year scores are 0.93163 and 0.9314546 under the Pearson
and Spearman method, respectively. This can be explained by the fact that there were some repeated exam marks seen
across the year 1, year 2 and final marks, consequently under the Spearman approach there will be some ties in the
rankings which slightly reduces the correlation coefficient as we take the average of the ranks.

Part iii)

When performing inference on the underlying population correlation coefficient ρ we use the fact that under the Fisher Z
transformation:
1 1+𝑟 1 1+𝜌 1
𝐼𝑓 𝑊 = 𝑙𝑛 𝑡ℎ𝑒𝑛 𝑊 𝑖𝑠 𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑒𝑙𝑦 𝑁( 𝑙𝑛 , )
2 1−𝑟 2 1−𝜌 𝑛−3

Where r is the correlation coefficient based on our sample, n is the number of datapoints available to us and ρ is our
population correlation coefficient under the null hypothesis. We are testing the null hypothesis that the population
correlation coefficient ρ is equal to 0.5 against the alternative hypothesis that ρ is more than 0.5. Under the null hypothesis
a correlation value of 0.5 suggests that the two variables are moderately correlated.

Test for year 1 scores and final scores:

𝑟 = 0.726476

𝑊 = 𝑡𝑎𝑛ℎ−1 (0.726476) = 0.921224

1 1
𝑊 ℎ𝑎𝑠 𝑎 𝑁( 𝑙𝑛3, )
2 161

𝑃(𝑊 > 0.921224)

𝑃(𝑍 > 4.7191) = 0.000001 so p value less than significance level of 5% thus we reject the null hypothesis, so we can
conclude that a population correlation coefficient of 0.5 isn’t sensible or appropriate.

Test for year 2 scores and final scores:

𝑟 = 0.931634

𝑊 = 𝑡𝑎𝑛ℎ−1 (0.931634) = 1.670623

1 1
𝑊 ℎ𝑎𝑠 𝑎 𝑁( 𝑙𝑛3, )
2 161

𝑃(𝑊 > 1.670623)

𝑃(𝑍 > 14.2279) 𝑠𝑜 𝑝 𝑣𝑎𝑙𝑢𝑒 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑡𝑙𝑦 𝑙𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 5% ℎ𝑒𝑛𝑐𝑒 𝑟𝑒𝑗𝑒𝑐𝑡 𝑛𝑢𝑙𝑙 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠. Thus, a population correlation
coefficient of 0.5 is unrealistic. So the data suggests the population correlation coefficient is more than 0.5.

Part iv)

Please see “file 2”

For the bivariate linear model we need to estimate the coefficients of linear regression by the method of least squares. We
need to estimate α and β (intercept and slope parameter) as well as the error variance 𝜎 2 . The bivariate regression line is
given by

̂
𝑦̂ =𝛼̂ +𝛽𝑥

I have used the following formulas in my calculations to determine the relevant coefficients of the bivariate linear model:
𝑆𝑥𝑦
𝛽̂ =
𝑆𝑥𝑥

𝑆𝑥𝑦 = ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 -𝑦̅)

𝑆𝑥𝑥 = ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2

𝑆𝑦𝑦 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2

1 𝑆𝑥𝑦2
̂2 =
𝜎 (𝑆𝑦𝑦 − )
𝑛−2 𝑆𝑥𝑥

𝛼̂ =𝑦̅-𝛽̂ 𝑥̅

Also, given that the bivariate linear regression model we are interested in is between the year 2 marks and the final degree
marks then the regressor variable X is the year 2 scores whereas the dependent variable/response variable Y is the final
degree scores.

I get the following values:

∑ 𝑥𝑖 2 𝑥̅ =66.4448 𝑛=164 𝑆𝑥𝑥 = ∑ 𝑥𝑖 2 − 𝑆𝑥𝑥 = 𝑆𝑥𝑥 =

𝑛𝑥̅ 2 743485.743 − 19440.26569
= 743485.743 164(66.44482 )
∑ 𝑦𝑖 2 𝑦̅ = 66.3586 𝑛=164 𝑆𝑦𝑦 = ∑ 𝑦𝑖 2 − 𝑆𝑦𝑦 = 𝑆𝑦𝑦 =
𝑛𝑦̅ 2 740739.322 − 18571.25979
= 740739.322 164(66.35862 )
∑ 𝑥𝑖 𝑦𝑖 𝑥̅ =66.4448 𝑦̅ = 66.3586 𝑛=164 𝑆𝑥𝑦 = ∑ 𝑥𝑖 𝑦𝑖 − 𝑆𝑥𝑦 =
̅̅̅̅
𝑛𝑥𝑦 17701.89153
= 740808.052
̂
𝛽=
𝑆𝑥𝑦
𝛽̂ =
17701.89153
𝛽̂ = 0.910578 ̂
𝛼̂ = 𝑦̅ − 𝛽𝑥̅ 𝛼̂ = 66.3586 − 𝛼̂ = 5.8554
𝑆𝑥𝑥 19440.26569
0.910578 ∗
66.4448

So the linear coefficients of the bivariate linear model are a slope parameter of 0.910578, an intercept of 5.8554 and an
error variance of 15.143. Therefore, our estimate for the bivariate linear regression line is 𝑦̂=5.8554 + 0.910578𝑥.

To verify these coefficients of the bivariate linear model I’ve produced a scatter plot showing the final marks against the
year 2 marks below:
The scatterplot obviously shows strong positive correlations and the trendline has an equation of 𝑦 = 0.9106𝑥 + 5.854
hence our estimates for α and β are an appropriate fit for the data.

Furthermore we can look at the variability explained by the model and the remaining variability that is unexplained by the
model. We use the following concept called the coefficient of determination which states:

𝑺𝑺𝑹𝑬𝑮
𝑹𝟐 =
𝑺𝑺𝑻𝑶𝑻

Where 𝑆𝑆𝑇𝑂𝑇 is the total variation in responses and 𝑆𝑆𝑅𝐸𝐺 is the level of variability that is credited to the relationship
between the dependent and regressor variable i.e. the variability that is explained by the regression model. So the
coefficient of determination gives us a proportion of how much variability is explained by the bivariate linear regression
model.

Applying the relevant quantities I get

16118.96505
𝑅2 = = 0.867952
18571.25979

This value is very high which indicates that the majority of variation is explained by the bivariate model and very little is left
over in residual variation. This supports that the bivariate linear regression model is a good fit for the data.

Part v)

To test the adequacy of the bivariate linear regression model obtained in the previous part we use the ANOVA test
assuming that under the null hypothesis that the slope parameter β is 0 suggesting no linear relationship between the two
variables against the alternative hypothesis that the slope parameter isn’t 0. Unlike the coefficient of determination
calculated in the previous part, the ANOVA test allows us to do a more formal test of fit taking into account the underlying
distributional assumptions. In this test we use the fact that
𝑆𝑆𝑅𝐸𝑆 2
~𝜒𝑛−2
(𝑛−2)

𝑆𝑆𝑇𝑂𝑇 2
~𝜒𝑛−1
𝑛−1

𝑆𝑆𝑅𝐸𝐺
𝑔𝑖𝑣𝑒𝑛 𝑆𝑆𝑅𝐸𝑆 𝑎𝑛𝑑 𝑆𝑆𝑅𝐸𝐺 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑡ℎ𝑒𝑛 ~𝜒12
𝜎2

By our bivariate linear regression model that we obtained in the previous part we had the following quantities:

𝑆𝑆𝑇𝑂𝑇 = 18571.25979 𝑆𝑆𝑅𝐸𝐺 = 16118.96505 𝑆𝑆𝑅𝐸𝑆 = 2452.29474

The ANOVA table is set out below( note n=164)

Source of variation Degrees of freedom Sum of squares Mean sum of squares

Regression 1 16118.96505 16118.96505

residual 162 2452.29474 15.13762

total 163 18571.25979

𝑀𝑆𝑆𝑅𝐸𝐺 16118.96505
The test statistic we use is = = 1064.828
𝑀𝑆𝑆𝑅𝐸𝑆 15.13762

So we test 1064.828 on a F(1,162) distribution. This generates a very small p value hence we reject 𝐻0 : 𝛽 = 0

As a result from performing the ANOVA test, we can conclude that there is a linear relationship between the year 2 and
final marks as we reject the null hypothesis which states that there isn’t a relationship.

Part v)

Please see “file 3”

We assume that the residuals of the bivariate linear regression model are independent and identically distributed with a
normal distribution of 𝑁(0, 𝜎 2 ).

The residuals (what’s left over) can be calculated by the formula:

𝒆̂𝒊 = 𝒚𝒊 − 𝒚
̂𝒊

Where 𝑦𝑖 is the observed value of the dependent variable and 𝑦̂𝑖 is the value of the dependent variable that is predicted by
our bivariate linear regression model. Recall that the bivariate linear regression model was 𝑦̂𝑖 = 5.8554 + 0.910578𝑥𝑖

To determine whether the bivariate linear model is a good fit for the data we need to test the residual marks for normality.
Here I have performed two tests to check this assertion: the normal probability plot (QQ plot), and a scatterplot.

Normal Probability Plot

0
-15.00000 -10.00000 -5.00000 0.00000 5.00000 10.00000 15.00000

-1

-2

-3

-4

From checking the QQ plot, the residuals marks seem to follow a normal distribution reasonably well, if the residuals were
to follow a normal distribution, we would expect the points to follow a straight diagonal line and we can see most of the
data points are on the trendline or near to it although the data point to the furthest right deviates a little bit from the
trendline possibly indicting some deficiency within the bivariate linear model. However, overall I think the linear model is a
good fit as most of the point are close to the line.
In addition we can plot the residual marks against the explanatory variable in a scatterplot/residual plot as shown below:

Residual Plot
30

10
Residuals

0
-15.00000 -10.00000 -5.00000 0.00000 5.00000 10.00000 15.00000

-10

-20

-30
residuals

According to the scatterplot of residual marks against the explanatory variable i.e. the year 2 scores, there doesn’t seem to
be any concrete pattern or relationship between the two variables instead the data points are randomly scattered and
varied throughout and there isn’t really anything to indicate that this isn’t a random pattern. This supports the fact that the
residual scores are independent of the explanatory variables as there isn’t a relationship between them that we can deduct
from the scatterplot. So, the random scatter supports that the residuals marks are normally distributed.

Overall, I would say that the bivariate linear regression models seem to be a good fit for the data. There are a couple of
discrepancies shown in the QQ plot but I believe they are small and minor discrepancies. Furthermore the coefficient of
determination we calculated earlier on indicated that approximately 87% of the overall variability is explained by the linear
bivariate model thus suggesting a good fit for the data. This in conjunction with the QQ plot and scatterplot which shows
that the data points are more varied and spread leads me to my conclusion that the bivariate linear regression model is a
sensible fit for the data.

Question 3:

Part i)
below is a correlation coefficient matrix showing the correlation between the module marks scored by students and the
final degree mark.

The numbers in yellow indicter very strong correlation the ones in green indicates moderately strong positive correlation,
the ones in blue indicate weakly moderate correlation and the red ones indicate weak correlation.

Part ii)

to choose a suitable multiple linear regression model and determine the number of parameters to include in the regression
model I have decided to use the forward selection approach. To start I selected the covariate with the highest correlation
coefficient with respect to the dependent variable which is the final degree scores. Then I choose the next parameter
based on which covariate increases the adjusted r squared value by the most amount and continue in this way until the
adjusted r squared value is maximised. So based on this, the first covariate I chose I the module scores for the module
Y2M7. This gives me the following:

By the forward selection approach we choose to add the next covariate based on which covariate improves our adjusted
coefficient of determination by the greatest amount. Hence the next covariate we add is Y2M6. We get the following.
As you can see adding the covariate of the Y2M6 marks improve the adjusted r squared value. Based on this approach the
next covariate we add is Y2M3. We get the following:

The next covariate we add is the Y2M2 scores:

Next we add the Y2M8 scores

Next we add the Y2M5 scores

Next we add the Y2M4 marks:

Next we add the Y2M1 marks:

Next we add the Y1M7 module scores for the students:

The next covariate we add is the scores of module Y1M8:

The final covariate we add is the scores for Y1M1:

Upon further analysis there are no more explanatory variables that we can add to this model without decreasing the
adjusted numerical value for the coefficient of determination hence the multiple linear model obtained here maximises
the adjusted r squared value at 0.87336 and is optimised. Therefore the multiple linear regression model we have by the
forward selective approach is:

𝑦 = 7.586188 + 0.047943𝑥1 − 0.07905𝑥2 + 0.02527𝑥3 + 0.082891𝑥4 + 0.0971865𝑥5 + 0.09694𝑥6 + 0.064159𝑥7 +

0.086145𝑥8 + 0.16545𝑥9 + 0.162355𝑥10 + 0.129127𝑥11 where 𝑥1 , 𝑥2 , … 𝑥11 are the covariates/explanatory variables i.e.
the scores for the various module’s students took and 7.586188 is the intercept term.

Part iii)

To assess the adequacy of the multiple linear regression model I decided to use the ANOVA test, the ANOVA analysis is set
out below.

comparing the f test p value to the significance level I have chosen at 5% (as this is the most commonly used significance
level) the p value from this ANOVA test is significantly less than 5% therefore we reject the null hypothesis so the sampled
data provides sufficient evidence to conclude that the multiple linear regression model fits the data better compared to
the model with no independent variables.

Part iv)

please see “file 4”

to see whether the multiple linear regression model is a good fit for the data we need to check the residual scores and test
them for the normality assumption. Like with the bivariate linear model the residuals are calculated by the difference
between the observed values of the dependent variable and the value of the dependent variable that is predicted by the
multiple linear regression model.

Below is the normal probability plot for the residual scores

Normal Probability Plot
4

0
-15.0000000 -10.0000000 -5.0000000 0.0000000 5.0000000 10.0000000 15.0000000

-1

-2

-3

-4

The results from the normal quantile probability plot are quite promising, the residual marks seem to follow a normal
distribution pattern very well as most of the data points are on the straight trendline or near to it although there is the
data point to the furthest left which deviates from the trendline relative to the other data points so potentially we may
have a possible outlier. However, overall the linear model seems like a very sensible fit for the data.

Dotplot of residuals
10

0
-4 -3 -2 -1 0 1 2 3 4
The dot plot showing standardised residuals seems to be a random pattern and the residuals scores are randomly
dispersed around the horizontal axis going from approximately -3.5 to 3, suggesting that the multiple linear regression
model is a good fit for the data and that the residual scores follow a normal distribution.

Histogram of standardized
residuals.
70

50
Frequency

30
Frequency
20

0
-3 -2 -1 0 1 2 3 More
bins

Also, the histogram chart showing the standardized residual scores seem to follow a normal distribution very well, the
histogram is almost perfectly symmetric similar to the standard normal distribution. Clearly there are no heavy skews or
outliers or heavy tail ends that suggests the residuals scores don’t follow a normal distribution. Hence, from performing
this check we can assert that the residual scores fit the normal distribution assumption very well and therefore the
multiple regression model is a sensible fit for the data.

Part v)

Overall, the bivariate and the multiple linear regression models seem like good fits for the data, although I believe there
are some potential deficiencies within both models shown in their relative quantile probability plots, however, these seem
like minor deficiencies within the model. The normal probability plot for the multiple linear regression model has more
point closer to the trendline than the quantile probability plot for the bivariate model potentially indicating more normality
in the residual marks for the multiple linear regression model. Furthermore, the adjusted value of the coefficient of
determination for the multiple linear model is 0.87336 slightly greater than the adjusted r value for the bivariate model at
0.867127. As well as this, the histogram I plotted strongly supported that the residual scores for the multiple linear
regression model followed a normal distribution very well. So, taking all of this into account leads me to my conclusion that
the multiple linear regression model is a more suitable fit for the data compared to the simple bivariate linear regression
model.

Question 4)

Generalised linear models are essentially an extension of the basic simple linear model with which you are already familiar
with. Similar to the simple linear regression model, the generalised linear model looks at the effect that the
covariates/explanatory variables have on the behaviour of the dependent variable (the variable that we are interested in).
However, as you recall with the simple bivariate linear model and the multiple linear regression model, one of the main
assumptions was that the response variable Y was limited to a normal distribution with 𝑌𝑖 ~𝑁(𝜇𝑖 , 𝜎 2 ). Now, with the
generalised linear model there’s a much greater deal of flexibility in the sense that the 𝑌𝑖 is able to take a range of
distributional forms in the exponential family such as the binomial, normal, Poisson, gamma, exponential, chi squared.

Clearly using the generalised linear model would be more advantageous and preferable over the basic linear model in
several real-life situations. For example, if an insurance company was interested in looking at how the size of car insurance
claims is affected by several factors like the age and gender of the policyholder, years of driving experience, make/model of
vehicle and the credit history then clearly using the assumptions of a normal distribution when looking at the size of car
insurance claims wouldn’t be appropriate and instead would likely be very problematic as claim sizes can never be negative
so using a basic linear model under the normal assumption may not generate reliable or meaningful results.

More importantly, there are 3 main components you need to know when constructing a generalised linear model which
are:

• A particular distribution for the response/dependent variable

• A linear predictor denoted by η
• A link/logit function

The linear predictor is a function of the covariates, in the simple bivariate case the linear predictor would be α+βx. So the
linear predictor effectively shows the relationship between the responsible variable and the covariates in the linear model.
Moreover, when using GLMs we can extend the model to include interaction effects between covariates known as an
interaction term as well as main effects acting in isolation. For example, if we were using a GLM to look at the level of
deaths related to cardiovascular disease and as part of the GLM we included the covariates of smoking, inactivity, and
weight, we would probably want to have an interaction term between inactivity and weight as these covariates are likely to
be strongly dependent on each other. So a possible linear predictor for this model may look like this:

η=𝛼𝑖 + β𝑥1 + 𝛾𝑥2 + 𝛿𝑥1 𝑥2

where 𝛼𝑖 is the categorical variable of smoking where i=1 if a person smokes and i=0 if a person doesn’t smoke, 𝑥1 is the
variable measuring the level of inactivity, 𝑥2 is the variable measuring weight and 𝑥1 𝑥2 is the interaction term combining
weight and inactivity and β,γ, δ are the respective coefficients of the covariates and the interaction variable.

Finally, the link function is a function which links the linear predictor and the mean response μ where the link function is
denoted by g(μ) where μ=E(Y). for the simple linear model the mean response and the link function were equal but in
GLMs this may not necessarily be the case. However for a link function to be valid it must be invertible i.e. Its inverse must
exist and also has to be differentiable. Furthermore, for each distribution we have a natural link function also known as a
canonical link function. These are set out below

Bibliography

06 Empirical Bayes Credibility theory.docx - 06 Empirical Bayes Credibility theory 1 Empirical Bayes Credibility Theory
Model 1 Describe the | Course Hero

Empirical Bayes Credibility Models for Economic Catastrophic Losses by Regions (itm-conferences.org)

AN INTRODUCTION TO GENERALIZED LINEAR MODELS (wordpress.com)

General Linear Models (GLM) (netdna-ssl.com)

Spearman's Rank-Order Correlation - A guide to when to use it, what it does and what the assumptions are. (laerd.com)

CHAPTER 15_250310_105408
No ratings yet
CHAPTER 15_250310_105408
4 pages
CS1 R Summary Sheets
No ratings yet
CS1 R Summary Sheets
26 pages
Astam Formula Sheet
No ratings yet
Astam Formula Sheet
10 pages
BG2209 Syllabus
No ratings yet
BG2209 Syllabus
3 pages
Exercises: CS1B-15: EBCT - Exercises
No ratings yet
Exercises: CS1B-15: EBCT - Exercises
4 pages
Credibility Theory Features of Actuar
100% (1)
Credibility Theory Features of Actuar
19 pages
ct62010 2013
100% (2)
ct62010 2013
165 pages
Data Interpretation Guide For All Competitive and Admission Exams
From Everand
Data Interpretation Guide For All Competitive and Admission Exams
Mohmmad Khaja Shareef
2.5/5 (6)
Case Study 2-Motion Picture Industry - Monica B. Thomas
No ratings yet
Case Study 2-Motion Picture Industry - Monica B. Thomas
1 page
Level I R09 Common Probability Distributions: Test Code: L1 R09 COPD Q-Bank 2020
No ratings yet
Level I R09 Common Probability Distributions: Test Code: L1 R09 COPD Q-Bank 2020
9 pages
GRP 3 - MT381 (Ebct)
No ratings yet
GRP 3 - MT381 (Ebct)
18 pages
HSTS418 Tutorial and Assignment 3 Jan2022
No ratings yet
HSTS418 Tutorial and Assignment 3 Jan2022
2 pages
Chapter_6.pdf
No ratings yet
Chapter_6.pdf
3 pages
Credibility Theory 2
No ratings yet
Credibility Theory 2
45 pages
Credibility Theory (tutorial problem set)
No ratings yet
Credibility Theory (tutorial problem set)
12 pages
BA3202-L3 Credibility
No ratings yet
BA3202-L3 Credibility
27 pages
Ratesem 2000 Handouts Int2
No ratings yet
Ratesem 2000 Handouts Int2
31 pages
6.825 Homework 3: 1 Easy EM
No ratings yet
6.825 Homework 3: 1 Easy EM
4 pages
Credibility Theory
No ratings yet
Credibility Theory
6 pages
QBUS2820 Mid-Semester 2015s2 (Solution)
No ratings yet
QBUS2820 Mid-Semester 2015s2 (Solution)
7 pages
15 GIT All Exercises
100% (1)
15 GIT All Exercises
162 pages
Python Task 2
No ratings yet
Python Task 2
5 pages
04_ES_Spectral
No ratings yet
04_ES_Spectral
39 pages
Risk Metrics
No ratings yet
Risk Metrics
32 pages
Empirical Finance8
No ratings yet
Empirical Finance8
11 pages
Lecture 11 - Forecasting Value at Risk (VaR) and ES (ES)
No ratings yet
Lecture 11 - Forecasting Value at Risk (VaR) and ES (ES)
67 pages
Lec4 17
No ratings yet
Lec4 17
22 pages
IandF CT6 201104 Examiners' Report
No ratings yet
IandF CT6 201104 Examiners' Report
14 pages
Solutions Manual to accompany Design And Analysis Of Experiments 6th edition 9780471487357 instant download
100% (2)
Solutions Manual to accompany Design And Analysis Of Experiments 6th edition 9780471487357 instant download
40 pages
E Question Exam 20102011
No ratings yet
E Question Exam 20102011
6 pages
Ch.2 Failure Distributions
No ratings yet
Ch.2 Failure Distributions
25 pages
Introduction To Risk Theory Summary PDF
No ratings yet
Introduction To Risk Theory Summary PDF
19 pages
Mathematics of Finance
No ratings yet
Mathematics of Finance
5 pages
2023_2_report (2) (1)
No ratings yet
2023_2_report (2) (1)
48 pages
BSTA 320 Comprehensive Exam Formula Sheet
No ratings yet
BSTA 320 Comprehensive Exam Formula Sheet
5 pages
LN12 Risk v3
No ratings yet
LN12 Risk v3
42 pages
Reliability Application
No ratings yet
Reliability Application
45 pages
Solutions Manual To Accompany An Introduction To Financial Markets
No ratings yet
Solutions Manual To Accompany An Introduction To Financial Markets
100 pages
Lecture Notes On Value at Risk & Risk Management
No ratings yet
Lecture Notes On Value at Risk & Risk Management
31 pages
Tute Exercises PDF
No ratings yet
Tute Exercises PDF
141 pages
Fundamentals of Statistics (18.6501x)
No ratings yet
Fundamentals of Statistics (18.6501x)
20 pages
FIN 5309 Homework 9 Solution Fall 2018: Instructions
No ratings yet
FIN 5309 Homework 9 Solution Fall 2018: Instructions
16 pages
Rourke
No ratings yet
Rourke
7 pages
Python Task 1
No ratings yet
Python Task 1
6 pages
LLICO2c_ECO1_English
No ratings yet
LLICO2c_ECO1_English
15 pages
Practical Solution of Partial Differential Equations in Finance PDF
100% (1)
Practical Solution of Partial Differential Equations in Finance PDF
27 pages
Deep Learning With Actuarial Applications
No ratings yet
Deep Learning With Actuarial Applications
248 pages
The Statistical Evaluation of Collective Risk Models With Various
No ratings yet
The Statistical Evaluation of Collective Risk Models With Various
26 pages
Paper Measure Uncertainty
No ratings yet
Paper Measure Uncertainty
32 pages
Azki Task Solution- Afshin Amiri
No ratings yet
Azki Task Solution- Afshin Amiri
7 pages
BauerZhu ForwardMort Application
No ratings yet
BauerZhu ForwardMort Application
22 pages
An improved model accuracy for forecasting risk measures application of ensemble methods
No ratings yet
An improved model accuracy for forecasting risk measures application of ensemble methods
31 pages
Risk Theory_ Methods for Market Risk Measurement
No ratings yet
Risk Theory_ Methods for Market Risk Measurement
62 pages
Extropy Estimation of Weibull Distribution Under Upper Records
No ratings yet
Extropy Estimation of Weibull Distribution Under Upper Records
7 pages
E1 Yifan Lu REPORT
No ratings yet
E1 Yifan Lu REPORT
16 pages
Presented By:-Tanushree Shekhawat M.Sc. Statistics (ACTUARIAL) Curaj
No ratings yet
Presented By:-Tanushree Shekhawat M.Sc. Statistics (ACTUARIAL) Curaj
22 pages
Stochastic Reserving Gi
No ratings yet
Stochastic Reserving Gi
53 pages
Backtesting Expected Shortfall - The Case of The Dhaka Stock Exchange
100% (1)
Backtesting Expected Shortfall - The Case of The Dhaka Stock Exchange
39 pages
Chapter 4 Hand Out
No ratings yet
Chapter 4 Hand Out
15 pages
Actsc445 f2022 Lec4
No ratings yet
Actsc445 f2022 Lec4
23 pages
Decision Science
No ratings yet
Decision Science
8 pages
ZivotPPT GARCH
No ratings yet
ZivotPPT GARCH
98 pages
Collaborative Work Unit 2
No ratings yet
Collaborative Work Unit 2
58 pages
SPFA01 6+Formula+Sheet
No ratings yet
SPFA01 6+Formula+Sheet
3 pages
Sampling Distributions and The Central Limit Theorem
No ratings yet
Sampling Distributions and The Central Limit Theorem
31 pages
Basic Probability: Business Statistics, A First Course (4e) © 2008 Pearson Education Chap 4-1
No ratings yet
Basic Probability: Business Statistics, A First Course (4e) © 2008 Pearson Education Chap 4-1
36 pages
Approximation Methods For The Total Claim Amount in Collective Risk Modeling
No ratings yet
Approximation Methods For The Total Claim Amount in Collective Risk Modeling
98 pages
MA 2261 Probability and Random Processes
No ratings yet
MA 2261 Probability and Random Processes
4 pages
(FREE PDF Sample) Essentials of Statistics For Business & Economics 9th Edition David R. Anderson - Ebook PDF Ebooks
100% (9)
(FREE PDF Sample) Essentials of Statistics For Business & Economics 9th Edition David R. Anderson - Ebook PDF Ebooks
41 pages
Assignment 1 PDF
No ratings yet
Assignment 1 PDF
6 pages
04 Task Performance 1
No ratings yet
04 Task Performance 1
6 pages
Activity-#3-Stat-Jessa T. Berdin
No ratings yet
Activity-#3-Stat-Jessa T. Berdin
7 pages
Stat 101 Mid Term 2021
No ratings yet
Stat 101 Mid Term 2021
6 pages
Engineering Probability and Statistics
No ratings yet
Engineering Probability and Statistics
10 pages
Chapter 7 - Fundamentals of Probability
No ratings yet
Chapter 7 - Fundamentals of Probability
105 pages
Module 2 (Updated)
No ratings yet
Module 2 (Updated)
71 pages
UserManual ActiveLearningReliability
No ratings yet
UserManual ActiveLearningReliability
43 pages
Chap016 - Sao Chép
No ratings yet
Chap016 - Sao Chép
30 pages
Complete Download Probability and statistics with R Second Edition. Edition Arnholt PDF All Chapters
No ratings yet
Complete Download Probability and statistics with R Second Edition. Edition Arnholt PDF All Chapters
81 pages
GRMD2102 - Homework 2
No ratings yet
GRMD2102 - Homework 2
3 pages
Binomial Probability Distribution
0% (1)
Binomial Probability Distribution
16 pages
RANDOM VARIABLES AND PROBABILITY DISTRIBUTION
No ratings yet
RANDOM VARIABLES AND PROBABILITY DISTRIBUTION
20 pages
Markov Chain and Markov Processes
No ratings yet
Markov Chain and Markov Processes
9 pages
Crosstabs: Crosstabs /TABLES A1 BY Remetroaa /format Avalue Tables /statistics CMH (1) /cells Count /count Round Cell
No ratings yet
Crosstabs: Crosstabs /TABLES A1 BY Remetroaa /format Avalue Tables /statistics CMH (1) /cells Count /count Round Cell
5 pages
QNT 561 Week 2 Weekly Learning Assessments - Assignment
No ratings yet
QNT 561 Week 2 Weekly Learning Assessments - Assignment
6 pages
CFA LVL II Quantitative Methods Study Notes
No ratings yet
CFA LVL II Quantitative Methods Study Notes
10 pages
Complete Download Hierarchical Modelling for the Environmental Sciences Statistical Methods and Applications 2006 James S. Clark PDF All Chapters
100% (2)
Complete Download Hierarchical Modelling for the Environmental Sciences Statistical Methods and Applications 2006 James S. Clark PDF All Chapters
56 pages
Activity 3-Table Completion On Parametric Test-Hazel P. Patalot
No ratings yet
Activity 3-Table Completion On Parametric Test-Hazel P. Patalot
2 pages
pattern Recognition_Unit_1&2
100% (1)
pattern Recognition_Unit_1&2
41 pages