Analysis of Categorical Data
Analysis of Categorical Data
STAT-7213
BIO-STATISTICS
M.PHIL STATISTICS
YEAR-1 (SEMESTER-II)
Submitted to:
3
TOPIC
ANALYSIS OF
CATEGORICAL
.
DATA
Categorical data is data that classifies an observation as belonging
to one or more categories. For example, an item might be judged as
good or bad, or a response to a survey might includes categories
such as agree, disagree, or no opinion
CATEGORICAL DATA
5
ANALYSIS OF CATEGORICAL DATA
6
BASIC ANALYSIS
7
GOODNESS-OF-FIT TEST
8
USES OF GOF TEST
9
METHODS OF GOODNESS-OF-FIT TEST
10
KEY POINTS
11
CHI-SQUARE TEST
12
TEST STATISTIC
13
DEGREES OF FREEDOM
14
PROCEDURE
1. State Null and Alternative Hypothesis.
2. Level of Significance.
3. Test Statistic.
2 =
4. Computation.
5. Critical Region:
reject
6. Conclusion.
If reject
15
GOVERNMENT COLLEGE UNIVERSITY
EXAMPLE
16
GOVERNMENT COLLEGE UNIVERSITY
SOLUTION
• Level of Significance.
• Test Statistic.
2 =
17
GOVERNMENT COLLEGE UNIVERSITY
PROCEDURE
• Computation.
• Critical Region:
• Conclusion.
As so we reject . And conclude that students do not enroll at random. 18
GOVERNMENT COLLEGE UNIVERSITY
CONTINGENCY TABLE
19
GOVERNMENT COLLEGE UNIVERSITY
TYPES CONTINGENCY TABLE
20
GOVERNMENT COLLEGE UNIVERSITY
2×2 CONTINGENCY TABLE
21
TEST STATISTIC
23
CRITERIA FOR FISHER EXACT TEST
24
ASSUMPTIONS
25
26
27
ODD RATIO
28
Difference Between ODDS AND ODDS RATIO
DEF: The odds for success are the ratio Odds ratio that we may compute from the
of the probability of success to the data of a retrospective study.
probability of failure. We use symbol OR to indicate that the
measure is computed from sample data and
The odds of being a case(having used as an estimate of population odds ratio
disease) to being a control(not having
disease) among subjects with risk factor
is [a/(a+b)]/[b/(a+b)]=a/b
The odds of being a case(having
disease) to being a control(not having
disease) among subjects without risk
factor is [c/(c+d)]/[d/(c+d)]=c/d
29
PROPERTIES
30
INTERPRETATION
A value of 1 indicates no association between the risk factor and disease status.
A value less than 1 indicates reduced odds of the disease among subjects with
the risk factor.
A value greater than 1 indicates increased odds of having the disease among
subjects in whom the risk factor is present
31
EXAMPLE
32
The odds of death sentence if the defendant was blacks= 28/45=0.6222
The impact of being black on receiving the death penalty is measured by the odds ratio. Such
as ;
INTERPRET
The odds of death sentence for black is 47% higher for blacks as compared to
non-blacks
33
YATE’S METHOD
Cochran suggests that chi square test should not be used if n is small and
expected frequency less than 5.
Yates (1934) proposed a procedure for correcting in case of 2*2 table, That is,
OR
34
CRITERIA FOR YATE’S CORRECTION
35
36
37
38
MATCHED-PAIR STUDIES
39
EXAMPLE
Pairs with the same exposure status for both case and control the diagonal cells
are called concordant pairs (c1and c2), and pairs with different exposures the off-
diagonal cells are called discordant pairs (d1 and d2).
40
EXAMPLE
Let be the probability that a discordant pair has an exposed case. Then, from the
preceding table, can be estimated by the following proportion,
41
HYPOTHESIS
Under the null hypothesis of no association between the risk factor and the
disease, each discordant pair is just as likely to have a case exposed as to have a
control
exposed. Thus, the null hypothesis can be written as
42
APPROXIMATION
For
large samples, we can use the normal approximation.
43
44
45
R × C CONTINGENCY TABLE
46
R × C CONTINGENCY TABLE
47
R × C CONTINGENCY TABLE
There are (r - 1)(c - 1) degrees of freedom for the r by c table because once we
know the frequencies of any (r - 1)(c - 1) cells, we can find the values of the
other frequencies by subtraction from the row and column totals. The hypothesis
of no association between the row and column variables is tested using the chi-
square goodness-of-fi t statistic. Most statisticians perform no adjustment to the
test statistic when used with tables other than the 2 by 2 table. If the test statistic
is greater than the value of , we reject the hypothesis of no association in favor
of the alternative that the row and column variables are related. If the test statistic
is less than we fail to reject the null hypothesis.
48
49
50
MULTIPLE 2×2 CONTINGENCY
TABLE
51
EXAMPLE
52
EXAMINE THE RELATIONSHIP
53
A study to determine if there is any association between the occurrence of upper respiratory infections (URI) of young children and outdoor
air pollution. There are several variables that could affect the relationship between the occurrence of infections and outdoor air pollution.
(I.E, dust, traffic, smoke etc) hypothetical data for this situation are based on an article by jaakkola et al. (1991) and are shown in table
54
EXAMPLE
55
EXAMPLE
56
EXAMPLE
57
SOLUTION
One way of taking the passive smoke variable into account is to analyze each 2 by 2 table
separately. Then we have two tables i.e, one who smoked and other who don’t smoked
Table.1
Calculations:
By using the chi-square and odd ratio formula, we have the XYC -square statistic is 2.039
and its p-value is 0.1533 for homes in which someone smoked. The odds ratio for this data
is 1.613. The 95 percent confidence intervals for the odds ratios is from 0.887 to 2.933
Table 2.
59
SOLUTION
Calculations:
By using the chi-square and odd ratio formula, we have the XYC -square statistic is 2.039
and its p-value is 0.1533 for homes in which someone smoked. The odds ratio for this data
is 1.613. The 95 percent confidence intervals for the odds ratios is from 0.887 to 2.933
Table 2.
60
SOLUTION
Calculations
The XYC -square value is 3.645, and its p-value is 0.0562 for those without passive smoke
in the home. The odds ratio for this data is 1.480. The 95 percent confidence intervals for
the odds ratios is from 1.007 to 2.171
Interpretation
The first confidence interval, a much wider interval than the second interval, includes the
value of one that suggests that there is no relation between the two variables. The second
interval barely misses including one. The second interval’s smaller size reflects the larger
sample size associated with the
home in which there was no passive smoke. Neither of these tables has a statistically
significant association between the outdoor air pollution and the occurrence of URI at the
0.05 level based on the test statistics. The conclusion from the analyses of the separate
tables is different from that of the combined table.
A problem with the use of the separate tables is that the analyses are based on the smaller
sample sizes associated with each sub-table, not on the sample size of the combined table.
This makes it diffificult to find the presence of small but consistent trends across tables.
61
COCHRAN MENTAL HAENSZEL TEST
Two bio statisticians, Nathan Mantel and William Haenszel, developed a method in 1959
for examining the relation between two categorical variables while controlling for another
categorical variable (Mantel and Haenszel 1959).
This method, like a method published by William Cochran in 1954, uses all the data in the
combined table and produces one overall test statistic. The test is designed to detect the
consistent effect of the independent variable on the dependent variable across the levels of
the extraneous variable.
Thus, this method should only be used when the estimated odds ratios in the Sub-tables
are similar to one another. One very attractive feature of this test is that it can be used with
extremely small sample sizes.
62
PROPERTIES
63
WHEN TO USE
64
CMH
We have one Z* test statistics, but we are dealing with discrete variables, we should use
the continuity correction with Z*. However, instead of using the continuity-corrected
Z* statistic, we would prefer to use a chi-square statistic, since all the other tests
associated with contingency tables use a chi-square statistic. This poses no problem, since
the square of a standard normal variable follows a chi-square distribution with one degree
of freedom. Thus, the statistic to be used to test the hypothesis of no association between
air pollution and the occurrence of upper respiratory problems is the Cochran-Mantel-
Haenszel chi-square statistic.
65
CMH
. Also called the Mantel-Haenszel statistic, it is defined by
where Oi and Ei are the observed and expected values in the (1,1) cell in the ith
sub-table.
In terms of the entries in the ith table, Ei is defined as,
66
VARIANCE
In XCMH-square O, E, and V are defined as the sums of the Oi, the Ei and the Vi
over the k subtables. If XCMH-square is greater than chi-square table value, we
reject the hypothesis of no association between air
pollution and the occurrence of upper respiratory infections. Otherwise we fail to
reject the null hypothesis.
67
EXAMPLE
68
EXAMPLE
69
EXAMPLE
70
MENTAL HEANSZEL COMMON ODD RATIOS
Mantel and Haenszel also showed how to combine the data from the separate sub tables to
form a common odds ratio for the data. Again, this should only be done when the
estimated odds ratios in the sub tables are similar. If the estimated odds ratios for the sub
tables are not similar — for example, some are less than one and some are greater than one
— the common odds ratio would not be very useful. The relation between the independent
and dependent variable would depend on the level of the extraneous variable, and the use
of a common odds ratio would mask this. The Mantel-Haenszel
estimator of the common odds ratio, θ is,
71
DISADVANTAGES
72
REFERENCES
• https://www3.nd.edu/~rwilliam/stats1/x51.pdf
• https://www.investopedia.com/terms/g/goodness-of-fit.asp
• https://www.statsdirect.com/help/chi_square_tests/22.htm
• https://www.statsdirect.com/help/chi_square_tests/22.htm
• https://onlinestatbook.com/2/chi_square/contingency.html
• https://www2.stat.duke.edu/courses/Spring02/sta102/chap16.pdf
73
RECOMMENDATION
• https://ncss-wpengine.netdna-ssl.com/wp-
content/themes/ncss/pdf/Procedures/NCSS/Contingency_Tables-Crosstabs-
Chi-Square_Test.pdf
74
THE END
75