0% found this document useful (0 votes)
11 views

CDA_Assignment4

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

CDA_Assignment4

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

PM-ASDS21 Categorical Data Analysis ASSIGNMENT 3

Prepared for Prepared by


Md. Rezaul Karim Muhammad Imran Khan
Associate Professor, Id. 201900303004
Department of Statistics, Section B, Batch 3,
Jahangirnagar University. PM-ASDS Program.

KEYWORDS: GLM, Maximum Likelihood estimate; 9 FEBRUARY 2022

Solution.

(a) Given linear probability model is ^π =0.7578−0.0695 x , where x = decades.

Interpretation of 0.7578: When x = 0, that is for decade 0 (1890 - 1899), the probability of the starting pitcher pitched a
complete game is 0.7578. In other words, in 75.78% games in decade 0, the starting pitcher pitched a complete game.

Interpretation of – 0.0695: The negative value indicates that the probability of the event of interest decreases every
decade. The value 0.0695 tells us that for every unit increase in x, i.e. for every next decade, the probability decreases by
0.0695. It is a decrease by 6.95% than the previous decade of the event of a starting pitcher pitching a complete game.

(b) ^
π ( 10 )=0.7578−0.0695 X 10=0.0628
^
π ( 11)=0.7578−0.0695 X 11=−0.0067
^
π ( 12 )=0.7578−0.0695 X 12=−0.0762
Probability can never be negative or greater than 1. So, certainly these results not only unlikely but the model itself is
invalid after a certain range of x. In fact, this is one of the structural defects for the linear probability model. Any value

1
for x larger than 10 will render into a larger negative value, and if we go backwards with x, x smaller than -3 will give us a
value larger than 1. So, the model is valid within the range x = [-2, 10]

Furthermore, all linear probability models have heteroskedasticity. Because all of the actual values for yi are either equal
to 0 or 1, but the predicted values are probabilities anywhere between 0 and 1 (and sometimes even greater or smaller),
the size of the residuals grow or shrink as the predicted values grow or shrink. [1] This also adversely impacts the
plausibility of such models.

exp ⁡(1.148−0.315 x)
(c) Given ML fit with logistic regression is ^π =
1+exp ⁡(1.148−0.315 x )

^ exp ⁡(1.148−0.315 X 10) exp ⁡(−2.002)


π (10)= = =¿0.11899
1+exp ⁡(1.148−0.315 X 10) 1+ exp ⁡(−2.002)

^ exp ⁡(1.148−0.315 X 11) exp ⁡(−2.317)


π (11)= = =0.0897
1+exp ⁡(1.148−0.315 X 11) 1+exp ⁡(−2.317)

^ exp ⁡(1.148−0.315 X 12) exp ⁡(−2.632)


π (12)= = =0.0671
1+ exp ⁡(1.148−0.315 X 12) 1+exp ⁡(−2.632)

Certainly, these are more plausible than the previous model, as the probabilities always remain within 0 and 1. So, it is
free from the major structural defect of the linear probability model. But still this model is monotonically decreasing,
which means as we forward through the decades, the probability of the event of interest will always decrease! This is
also absurd.

Data preparation

Code and output Explanation


data = This loads the data in the DataFrame named data here.
read.table("C:/Users/HP/Desktop/CDA/Assign4/crab.txt", Since the data file we downloaded had no column names,
col.names = c('serial','color', 'spine', we give those using the col.names argument in the
'width','satellite','weight')) read.table() function.
> head(data) We take a peek at the first few observations in the
serial color spine width satellite weight dataset. The serial is not required here. So, we will drop
1 1 3 3 28.3 8 3.05 it, and add a row with binomial variable.
2 2 4 3 22.5 0 1.55 We set y = 0, if satellite = 0, and set y = 1, otherwise.
3 3 2 1 26.0 9 2.30
4 4 4 3 24.8 0 2.10
5 5 4 3 26.0 4 2.60

2
6 6 3 3 23.8 0 2.10
data[,1] = NULL The code snippet achieves what we sought before.
data = transform(data, y = ifelse(satellite>0,1,0)) A peek again at the data now is presented here.
head(data)
> head(data)
color spine width satellite weight y
1 3 3 28.3 8 3.05 1
2 4 3 22.5 0 1.55 0
3 2 1 26.0 9 2.30 1
4 4 3 24.8 0 2.10 0
5 4 3 26.0 4 2.60 1
6 3 3 23.8 0 2.10 0

With this data, we are ready to answer the questions.

(a) Linear probability model with OLS (Ordinary Least Square) parameter estimation

LinearProbModelOLS = lm(y~weight, data > summary(LinearProbModelOLS)


= data)
summary(LinearProbModelOLS) Call:
lm(formula = y ~ weight, data = data)

Residuals:
Min 1Q Median 3Q Max
-0.8878 -0.4683 0.1606 0.3704 0.6689

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.14487 0.14715 -0.984 0.326
weight 0.32270 0.05876 5.492 1.42e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4447 on 171 degrees of freedom


Multiple R-squared: 0.1499, Adjusted R-squared: 0.1449
F-statistic: 30.16 on 1 and 171 DF, p-value: 1.421e-07
Interpretation for the parameter estimate: The model we received here using OLS is
π^ =−0.14487+0.3227 Xweight , the parameter is 0.3227. This implies that for every unit change in weight,
the probability of having at least one satellite (i.e. Y = 1) increases by 0.3227. The positive sign indicates that
as weight increase, this probability increases alongside.
> predict(LinearProbModelOLS, newdata = Given the model being ^π =−0.14487+0.3227 Xweight ,
data.frame("weight"=c(5.2)), type= ^
π (5.2)=−0.14487+ 0.3227 X 5.2=1.53317 . This is approximately
"response")
what we got using the predict() function on the left.
1
1.533186
Comment: The result is invalid, as the probability cannot exceed 1. This,
as we discussed before, is a major structural defect of linear probability
models.

3
(b) Linear probability model with ML (Maximum Likelihood) parameter estimation

LinearProbModelML = glm(y~weight, > summary(LinearProbModelML)


family = gaussian(link = "identity"),
data=data) Call:
summary(LinearProbModel) glm(formula = y ~ weight, family = gaussian(link = "identity"),
data = data)
predict(LinearProbModel, newdata =
data.frame("weight"=c(5.2)), type= Deviance Residuals:
"response") Min 1Q Median 3Q Max
-0.8878 -0.4683 0.1606 0.3704 0.6689

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.14487 0.14715 -0.984 0.326
weight 0.32270 0.05876 5.492 1.42e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.1977574)

Null deviance: 39.780 on 172 degrees of freedom


Residual deviance: 33.817 on 171 degrees of freedom
AIC: 214.56

Number of Fisher Scoring iterations: 2


Interpretation for the parameter estimate: The model we received here using ML is
π^ =−0.14487+0.3227 Xweight , the parameter is 0.3227. This implies that for every unit change in weight,
the probability of having at least one satellite (i.e. Y = 1) increases by 0.3227. The positive sign indicates that
as weight increase, this probability increases alongside.
> predict(LinearProbModelML, newdata = We got the same model of OLS using ML. So, the probability is still
data.frame("weight"=c(5.2)), type= greater than 1
"response")
1 Comment: The result is invalid, as the probability cannot exceed 1. This,
1.533186 as we discussed before, is a major structural defect of linear probability
models.

(c) Logistic Regression model

> LogisticRegModel = glm(y~weight, family = > summary(LogisticRegModel)


binomial(link = "logit"), data = data)
Call:
glm(formula = y ~ weight, family = binomial(link = "logit"),
data = data)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.1108 -1.0749 0.5426 0.9122 1.6285

4
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.6947 0.8802 -4.198 2.70e-05 ***
weight 1.8151 0.3767 4.819 1.45e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 225.76 on 172 degrees of freedom


Residual deviance: 195.74 on 171 degrees of freedom
AIC: 199.74

Number of Fisher Scoring iterations: 4


> predict(LogisticRegModel, newdata = For logistic regression model above, we get
data.frame("weight"=c(5.2)), type= "response") ^ exp ⁡(−3.6947+1.8151 X 5.2) exp ⁡(5.7433)
1 π (5.2)= = =0
1+ exp ⁡(−3.6947+1.8151 X 5.2) 1+ exp ⁡(5.7433)
0.9968084

So, the model above predicts a 99.68% probability of having at least one satellite for a crab having weight = 5.2 kg.

(d) Probit model

ProbitModel = glm(y~weight, family = binomial(link = > summary(ProbitModel)


"probit"), data = data)
summary(ProbitModel) Call:
glm(formula = y ~ weight, family = binomial(link =
"probit"),
data = data)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.1436 -1.0774 0.5336 0.9196 1.6216

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2383 0.5116 -4.375 1.22e-05 ***
weight 1.0990 0.2151 5.108 3.25e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 225.76 on 172 degrees of freedom


Residual deviance: 195.46 on 171 degrees of freedom
AIC: 199.46

Number of Fisher Scoring iterations: 4


> predict(ProbitModel, newdata =
data.frame("weight"=c(5.2)), type= "response")
1
0.9997462

5
The probit model above predicts a 99.97% probability of having at least one satellite for a crab having weight = 5.2 kg.

In the dataset we prepared for the previous problem, the satellite column is what we have as Y in this question. So, we
will use the satellite column as a response variable here.

(a) Given model is log μ ( satellite )=−0.4284 +0.5893 X weight , so, for weight = 2.4 kg, the model renders -

log μ ( satellite )=−0.4284 +0.5893 X 2.44=1.0095

So, μ ( satellite ) ¿ e 1.0095 =2.744 ≈ 3

For female crabs with average weight 2.44 kg, we would expect them to have 3 satellites on the average.

(b)

6
Solution.

Assuming the variables notation are Asthma = A, Smoking = S, and Gender = G

17 X 274
(a) Odds ratio of boys with mothers smoking ≥ ½ pack everyday, θ AS (G =Boys)= =1.8
63 X 41

7
8 X 261
(b) Odds ratio of girls with mothers smoking ≥ ½ pack everyday, θ AS (G =Girls)= =1.898 ≈ 1.9
20 X 55

(c) The partial odds ratios are almost similar. So, the notion of homogeneity cannot be ignored. Below is a test for it.

H 0 :The givendata ishomogeneous ,θ AS ( G=Boys ) ¿ θ AS ( G=Girls)

H 1 : The given data is not homogeneous ,θ AS (G= Boys ) ≠ θ AS ( G=Girls)


2
(Oijk−E ijk )
We calculate the Breslow-Day statistic = ∑ ∑ ∑ ꭓ df =K −1 ,α
2

i j k Eijk

Below are the steps of Breslow-Day test of homogeneous association.

Step 1. Calculate the adjusted Mantel Haenszel Odds Ratio


The expected frequency calculation for Breslow-Day uses the adjusted Mantel_Haenszel odds ratio statistic. The Mantel-
(17 X 274 ) (8 X 261)
+
^ 395 344 11.7924+6.0698 17.8622
Haenszel odds ratio from the given partial tables is θ= = = =1.83
(63 X 41) (20 X 55) 6.5392+3.1977 9.7369
+
395 344

Now, we saw that our partial tables have odds ratios 1.8 and 1.9. To be homogeneous (for the null hypothesis to be
true), we need to adjust the values of our partial tables in such a way so that both partial tables have odds ratio = 1.83

Step 2. Find the adjusted frequency of 1st cell in each partial table
We will adjust the first cell (n11k) value of each of the partial table keeping the marginal totals same, and so, the rest of
the cells will be auto-adjusted to match the marginal totals. We adjust them so that they have odds ratio equal to the
adjusted Mantel-Haenszel odds ratio 1.83

For G = Boys table, assume A is the adjusted value of the first field.

1.83= A X ¿ ¿
2
⇒ 0.83 A −509.54 A+8491.2=0
Solving the quadratic equation with a statistical calculator (like Casio fx-991EX) gives us,

A = 596. 76 or A = 17.143

A cannot be 596.76, because that exceeds the total of row/column/sample size for the first partial table. So, adjusted A
for the first partial table is 17.143

For G = Girls,

A X ( 281−28+ A) 253 A+ A
2
1.83= =
(63− A) X (28− A) 1764−91 A+ A 2
2
⇒ 0.83 A −419.53 A+3228.12=0

8
This gives us A = 497.64 or A = 7.82

So, we get A = 7.82 here as the other value is too large to fit into our second partial table.

Step 3. Adjust all the values as per the new values, verify if the Odds Ratio for
adjusted values is 1.83
Now we put the adjusted values of the first cell within parenthesis, and reset other cell values to match the fixed row
and column totals.

G = Boys G = Girls

A = Yes A = No Total A = Yes A = No Total


S≥½ 17 63 80 S≥½ 8 55 63
(17.143 (62.857) (7.82) (55.18)
) S<½ 20 261 281
S<½ 41 274 315 (20.18) (260.82)
(40.857 (274.143) Total 28 316 344
)
Total 58 337 395
7.82 X 260.82
Odds Ratio = θ= =1.832 ≈ 1.83
55.18 X 20.18
17.143 X 274.143
Odds Ratio = θ= =1.8299 ≈ 1.83
62.857 X 40.857

So, with our adjusted values, both partial odds ratios are equal to the adjusted Mantel-Haenszel odds ratio.

Step 4. Calculate Variances and Breslow-Day statistic


−1
1 1 1 1 −1
Variance (Partial table 1) = ( + + + ) =0.1023 =9.7 69
17.143 62.857 40.857 274.143

−1
1 1 1 1 −1
Variance (Partial table 2) = ( + + + ) =0.1994 =5.015
7.82 55.18 20.18 260.82

2 2 2 2
(17−17.143) (63−62.857) (41−40.857) (274−274.143)
ꭓ 2
Breslow−Day 1 = + + + =0.0021+0.0021+0.0021+0.0021=0.00
9.7 69 9.7 69 9.7 69 9.7 69

2 2 2 2
(8−7.82) (55−55.18) (20−20.18) (261−260.82)
ꭓ 2Breslow−Day 2= + + + =0.0065+0.0065+ 0.0065+0.0 0 65=0.026
5.015 5.015 5.015 5.015

So, Breslow-Day statistic = 0.0084 + 0.026 = 0.0344

9
Given K=2, for 95% confidence, ꭓ df =K−1 , α = ꭓ 1 ,0.05=3.84
2 2

Since our calculated statistic is less than ꭓ 1 ,0.05 , we accept the null hypothesis, i.e., we have homogeneous association in
2

the given partial tables.

We know that Mantel-Haenszel common odds ratio is a useful statistic when we have homogeneous association in the
partial tables. Since we do have it here, it is quite appropriate to calculate a common odds ratio.

(d)

(17 X 274) (8 X 261)


+
395 344 11.7924 +6.0698 17.8622
The Mantel-Haenszel common odds ratio θ^ MH = = = =1.83 4
(63 X 41) (20 X 55) 6.5392+3.1977 9.7369
+
395 344

A 95% confidence interval for this common odds ratio is [1.102, 3.053], with a p-value less than significance level α =
0.05. So, the result is significant.

Manual calculation

We form marginal table for the given data.

A = Yes A = No Total
S≥½ 25 118 143
S<½ 61 535 281
Total 86 653 739

Log(θ^ MH ) = ln(1.834) = 0.6065

1 1 1 1
Variance = + + + =0.066737
25 118 61 535
Standard error = √ variance=√ 0.066737=0.2583

10
For 95% confidence, margin of error = 1.96 X 0.2583 = 0.5063

So, 95% confidence interval of log-odds ratio = 0.6065 ± 0.5063=[0.1002 ,1.1128 ]

95% confidence interval of log-odds ratio = exp[ 0.1002 ,1.1128 ] =[ e 0.1002 , e 1.1128 ] =[1.105 , 3.043]

This is almost similar to what we got from the StatXact 12 application (snapped above).

(e)

Although we have homogeneous association between odds ratio for boys and girls, that is, θ AS ( G=Boys ) ¿ θ AS (G =Girls ), they
both are NOT equal to 1.

H 0 :The givendata isconditionally independent ,θ AS ( G=Boys ) ¿ θ AS (G =Girls)=1

H 1 : The given data is not homogeneous ,θ AS (G= Boys )∧¿∨θ AS (G =Girls ) ≠ 1

We can test conditional independence via Cochran-Mantel-Haenszel test below.

Gender G Smoking S Asthma = Asthma Total Odds Expected value of Variance of n11 k,
Yes = No Ratio n11 k =μ11k Var (n 11k )
Boys S≥½ 17 63 80 (17 X 274) (80 X 58) (80 X 315 X 58 X 337)
=1.8 =11 . 75
S<½ 41 274 315 (63 X 41) (395) ¿¿
58 337 395
Girls S≥½ 8 55 63 (8 X 2 61) (63 X 2 8) (63 X 281 X 2 8 X 3 16)
=1. 9 =5 .13
S<½ 20 261 281 (55 X 20) (3 44) ¿¿
28 316 344

∑ n11k =17 +8=25 , ∑ μ 11k =11.75 +5.13=16.88, ∑ Var ( n 11k )=8.01+ 3.86=11.87
2
(25−16.88) 65.9344
CMH statistic = = =5.555
11.87 11.87
Since CMH is distributed as ꭓ df =K−1 , α, we will check our computed statistic with the critical value from chi-square table.
2

At 95% confidence, α = 0.05, and degree of freedom = K-1 = 2-1 = 1 (K = number of partial table)

ꭓ 1 ,0.05 =3.84 1
2

CMH > ꭓ 1 ,0.05 , and so we reject null hypothesis. In other words, our data is NOT conditionally independent.
2

11
References

[1] Linear Probability Model, https://murraylax.org/rtutorials/linearprob.html

12

You might also like