CDA_Assignment4
CDA_Assignment4
Solution.
Interpretation of 0.7578: When x = 0, that is for decade 0 (1890 - 1899), the probability of the starting pitcher pitched a
complete game is 0.7578. In other words, in 75.78% games in decade 0, the starting pitcher pitched a complete game.
Interpretation of – 0.0695: The negative value indicates that the probability of the event of interest decreases every
decade. The value 0.0695 tells us that for every unit increase in x, i.e. for every next decade, the probability decreases by
0.0695. It is a decrease by 6.95% than the previous decade of the event of a starting pitcher pitching a complete game.
(b) ^
π ( 10 )=0.7578−0.0695 X 10=0.0628
^
π ( 11)=0.7578−0.0695 X 11=−0.0067
^
π ( 12 )=0.7578−0.0695 X 12=−0.0762
Probability can never be negative or greater than 1. So, certainly these results not only unlikely but the model itself is
invalid after a certain range of x. In fact, this is one of the structural defects for the linear probability model. Any value
1
for x larger than 10 will render into a larger negative value, and if we go backwards with x, x smaller than -3 will give us a
value larger than 1. So, the model is valid within the range x = [-2, 10]
Furthermore, all linear probability models have heteroskedasticity. Because all of the actual values for yi are either equal
to 0 or 1, but the predicted values are probabilities anywhere between 0 and 1 (and sometimes even greater or smaller),
the size of the residuals grow or shrink as the predicted values grow or shrink. [1] This also adversely impacts the
plausibility of such models.
exp (1.148−0.315 x)
(c) Given ML fit with logistic regression is ^π =
1+exp (1.148−0.315 x )
Certainly, these are more plausible than the previous model, as the probabilities always remain within 0 and 1. So, it is
free from the major structural defect of the linear probability model. But still this model is monotonically decreasing,
which means as we forward through the decades, the probability of the event of interest will always decrease! This is
also absurd.
Data preparation
2
6 6 3 3 23.8 0 2.10
data[,1] = NULL The code snippet achieves what we sought before.
data = transform(data, y = ifelse(satellite>0,1,0)) A peek again at the data now is presented here.
head(data)
> head(data)
color spine width satellite weight y
1 3 3 28.3 8 3.05 1
2 4 3 22.5 0 1.55 0
3 2 1 26.0 9 2.30 1
4 4 3 24.8 0 2.10 0
5 4 3 26.0 4 2.60 1
6 3 3 23.8 0 2.10 0
(a) Linear probability model with OLS (Ordinary Least Square) parameter estimation
Residuals:
Min 1Q Median 3Q Max
-0.8878 -0.4683 0.1606 0.3704 0.6689
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.14487 0.14715 -0.984 0.326
weight 0.32270 0.05876 5.492 1.42e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
3
(b) Linear probability model with ML (Maximum Likelihood) parameter estimation
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.14487 0.14715 -0.984 0.326
weight 0.32270 0.05876 5.492 1.42e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1108 -1.0749 0.5426 0.9122 1.6285
4
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.6947 0.8802 -4.198 2.70e-05 ***
weight 1.8151 0.3767 4.819 1.45e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So, the model above predicts a 99.68% probability of having at least one satellite for a crab having weight = 5.2 kg.
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1436 -1.0774 0.5336 0.9196 1.6216
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2383 0.5116 -4.375 1.22e-05 ***
weight 1.0990 0.2151 5.108 3.25e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
5
The probit model above predicts a 99.97% probability of having at least one satellite for a crab having weight = 5.2 kg.
In the dataset we prepared for the previous problem, the satellite column is what we have as Y in this question. So, we
will use the satellite column as a response variable here.
(a) Given model is log μ ( satellite )=−0.4284 +0.5893 X weight , so, for weight = 2.4 kg, the model renders -
For female crabs with average weight 2.44 kg, we would expect them to have 3 satellites on the average.
(b)
6
Solution.
17 X 274
(a) Odds ratio of boys with mothers smoking ≥ ½ pack everyday, θ AS (G =Boys)= =1.8
63 X 41
7
8 X 261
(b) Odds ratio of girls with mothers smoking ≥ ½ pack everyday, θ AS (G =Girls)= =1.898 ≈ 1.9
20 X 55
(c) The partial odds ratios are almost similar. So, the notion of homogeneity cannot be ignored. Below is a test for it.
i j k Eijk
Now, we saw that our partial tables have odds ratios 1.8 and 1.9. To be homogeneous (for the null hypothesis to be
true), we need to adjust the values of our partial tables in such a way so that both partial tables have odds ratio = 1.83
Step 2. Find the adjusted frequency of 1st cell in each partial table
We will adjust the first cell (n11k) value of each of the partial table keeping the marginal totals same, and so, the rest of
the cells will be auto-adjusted to match the marginal totals. We adjust them so that they have odds ratio equal to the
adjusted Mantel-Haenszel odds ratio 1.83
For G = Boys table, assume A is the adjusted value of the first field.
1.83= A X ¿ ¿
2
⇒ 0.83 A −509.54 A+8491.2=0
Solving the quadratic equation with a statistical calculator (like Casio fx-991EX) gives us,
A = 596. 76 or A = 17.143
A cannot be 596.76, because that exceeds the total of row/column/sample size for the first partial table. So, adjusted A
for the first partial table is 17.143
For G = Girls,
A X ( 281−28+ A) 253 A+ A
2
1.83= =
(63− A) X (28− A) 1764−91 A+ A 2
2
⇒ 0.83 A −419.53 A+3228.12=0
8
This gives us A = 497.64 or A = 7.82
So, we get A = 7.82 here as the other value is too large to fit into our second partial table.
Step 3. Adjust all the values as per the new values, verify if the Odds Ratio for
adjusted values is 1.83
Now we put the adjusted values of the first cell within parenthesis, and reset other cell values to match the fixed row
and column totals.
G = Boys G = Girls
So, with our adjusted values, both partial odds ratios are equal to the adjusted Mantel-Haenszel odds ratio.
−1
1 1 1 1 −1
Variance (Partial table 2) = ( + + + ) =0.1994 =5.015
7.82 55.18 20.18 260.82
2 2 2 2
(17−17.143) (63−62.857) (41−40.857) (274−274.143)
ꭓ 2
Breslow−Day 1 = + + + =0.0021+0.0021+0.0021+0.0021=0.00
9.7 69 9.7 69 9.7 69 9.7 69
2 2 2 2
(8−7.82) (55−55.18) (20−20.18) (261−260.82)
ꭓ 2Breslow−Day 2= + + + =0.0065+0.0065+ 0.0065+0.0 0 65=0.026
5.015 5.015 5.015 5.015
9
Given K=2, for 95% confidence, ꭓ df =K−1 , α = ꭓ 1 ,0.05=3.84
2 2
Since our calculated statistic is less than ꭓ 1 ,0.05 , we accept the null hypothesis, i.e., we have homogeneous association in
2
We know that Mantel-Haenszel common odds ratio is a useful statistic when we have homogeneous association in the
partial tables. Since we do have it here, it is quite appropriate to calculate a common odds ratio.
(d)
A 95% confidence interval for this common odds ratio is [1.102, 3.053], with a p-value less than significance level α =
0.05. So, the result is significant.
Manual calculation
A = Yes A = No Total
S≥½ 25 118 143
S<½ 61 535 281
Total 86 653 739
1 1 1 1
Variance = + + + =0.066737
25 118 61 535
Standard error = √ variance=√ 0.066737=0.2583
10
For 95% confidence, margin of error = 1.96 X 0.2583 = 0.5063
95% confidence interval of log-odds ratio = exp[ 0.1002 ,1.1128 ] =[ e 0.1002 , e 1.1128 ] =[1.105 , 3.043]
This is almost similar to what we got from the StatXact 12 application (snapped above).
(e)
Although we have homogeneous association between odds ratio for boys and girls, that is, θ AS ( G=Boys ) ¿ θ AS (G =Girls ), they
both are NOT equal to 1.
Gender G Smoking S Asthma = Asthma Total Odds Expected value of Variance of n11 k,
Yes = No Ratio n11 k =μ11k Var (n 11k )
Boys S≥½ 17 63 80 (17 X 274) (80 X 58) (80 X 315 X 58 X 337)
=1.8 =11 . 75
S<½ 41 274 315 (63 X 41) (395) ¿¿
58 337 395
Girls S≥½ 8 55 63 (8 X 2 61) (63 X 2 8) (63 X 281 X 2 8 X 3 16)
=1. 9 =5 .13
S<½ 20 261 281 (55 X 20) (3 44) ¿¿
28 316 344
∑ n11k =17 +8=25 , ∑ μ 11k =11.75 +5.13=16.88, ∑ Var ( n 11k )=8.01+ 3.86=11.87
2
(25−16.88) 65.9344
CMH statistic = = =5.555
11.87 11.87
Since CMH is distributed as ꭓ df =K−1 , α, we will check our computed statistic with the critical value from chi-square table.
2
At 95% confidence, α = 0.05, and degree of freedom = K-1 = 2-1 = 1 (K = number of partial table)
ꭓ 1 ,0.05 =3.84 1
2
CMH > ꭓ 1 ,0.05 , and so we reject null hypothesis. In other words, our data is NOT conditionally independent.
2
11
References
12