0% found this document useful (0 votes)
152 views75 pages

Analysis of Categorical Data

This document discusses the analysis of categorical data. It defines categorical data as data that classifies observations into categories. Some common methods for analyzing categorical data discussed include goodness-of-fit tests, contingency tables, and odds ratios. The chi-square test and Fisher's exact test are presented as methods for analyzing contingency tables. Examples are provided to demonstrate how to perform chi-square tests and calculate odds ratios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views75 pages

Analysis of Categorical Data

This document discusses the analysis of categorical data. It defines categorical data as data that classifies observations into categories. Some common methods for analyzing categorical data discussed include goodness-of-fit tests, contingency tables, and odds ratios. The chi-square test and Fisher's exact test are presented as methods for analyzing contingency tables. Examples are provided to demonstrate how to perform chi-square tests and calculate odds ratios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 75

.

STAT-7213
BIO-STATISTICS

M.PHIL STATISTICS
YEAR-1 (SEMESTER-II)

Submitted to:

Dr. Jamal Abdul Nasir


PRESENTATION
PRESENTED BY

MUBEEN ASGHAR (0557)


LAIBA SUBHANI (0559)
NOOR-E-AMNA (0561)
RABIA SAIF (0563)

3
TOPIC

ANALYSIS OF
CATEGORICAL
.

DATA
Categorical data is data that classifies an observation as belonging
to one or more categories. For example, an item might be judged as
good or bad, or a response to a survey might includes categories
such as agree, disagree, or no opinion
CATEGORICAL DATA
5
ANALYSIS OF CATEGORICAL DATA

Categorical data analysis is the analysis of data where the


response variable has been grouped into a set of mutually exclusive
ordered (such as age group) or unordered (such as eye color)
categories.

6
BASIC ANALYSIS

 The Goodness-of-Fit Test


 The 2 by 2 Contingency Table
 The r by c Contingency Table
 Multiple 2 by 2 Contingency Tables

7
GOODNESS-OF-FIT TEST

The goodness-of-fit test is a statistical hypothesis test to see how


well sample data fit a distribution from a population with a normal
distribution.

8
USES OF GOF TEST

• Goodness-of-fit tests are statistical methods often used to make


inferences about observed values.
• These tests determine how related actual values are to the predicted
values in a model, and when used in decision-making, goodness-of-
fit tests can help predict future trends and patterns.
• Goodness-of-fit tests are commonly used to test for the normality of
residuals or to determine whether two samples are gathered from
identical distributions.

9
METHODS OF GOODNESS-OF-FIT TEST

There are multiple methods for determining goodness-of-fit. Some


of the most popular methods used in statistics include the
• Chi-square
• The Kolmogorov-Smirnov test
• The Anderson-Darling test
• The Shipiro-Wilk test

10
KEY POINTS

• Goodness-of-fit tests are statistical tests aiming to determine whether a set of


observed values match those expected under the applicable model.
• There are multiple types of goodness-of-fit tests, but the most common is the
chi-square test.
• Chi-square determines if a relationship exists between categorical data.
• The Kolmogorov-Smirnov test used for large samples determines whether a
sample comes from a specific distribution of a population.
• Goodness-of-fit tests can show you whether your sample data fit an expected set
of data from a population with normal distribution.

11
CHI-SQUARE TEST

The chi-square independence test is a procedure for


testing, if two categorical variables are related in some
population.

12
TEST STATISTIC

Where, O is an observed frequency and E is an estimated expected frequency.


 
E=

13
DEGREES OF FREEDOM

The degrees of freedom is basically a


number that determines the exact
shape of our distribution. The figure
illustrates this point.
degrees of freedom -or df- are calculated as
df = (r-1)*(c-1)

14
PROCEDURE
1. State Null and Alternative Hypothesis.

2. Level of Significance.

3. Test Statistic.
  2 =
4. Computation.
5. Critical Region:
reject

6. Conclusion.
If reject
15
GOVERNMENT COLLEGE UNIVERSITY
EXAMPLE

Popularity of psychology professors who enrolled students in college at 0.05


significance level test the random enrolment of students.

16
GOVERNMENT COLLEGE UNIVERSITY
SOLUTION

• State Null and Alternative Hypothesis.

• Level of Significance.
 

• Test Statistic.
2  =
17
GOVERNMENT COLLEGE UNIVERSITY
PROCEDURE

• Computation.

 • Critical Region:

• Conclusion.
As so we reject . And conclude that students do not enroll at random. 18
GOVERNMENT COLLEGE UNIVERSITY
CONTINGENCY TABLE

A contingency table (also known as a cross tabulation or crosstab)


is a type of table in a matrix format that displays the
(multivariate) frequency distribution of the variables.

19
GOVERNMENT COLLEGE UNIVERSITY
TYPES CONTINGENCY TABLE

• 2×2 Contingency table


• r × c Contingency table
• Multiple 2×2 Contingency table

20
GOVERNMENT COLLEGE UNIVERSITY
2×2 CONTINGENCY TABLE

The two by two or fourfold contingency


table represents two classifications of a set of counts or frequencies.
The rows represent two classifications of one variable (e.g.,
outcome positive/outcome negative) and the columns
represent two classifications of another variable (e.g.,
intervention/no intervention).

21
TEST STATISTIC

 where, for r rows and c columns of n observations, O is an observed frequency and E


is an estimated expected frequency.
 
E=
22
FISHER EXACT TEST

Fisher's exact test is a statistical significance test used in the


analysis of contingency tables.
Although in practice it is employed when sample sizes are small.

23
CRITERIA FOR FISHER EXACT TEST

 Both variables are dichotomous qualitative (2 cross 2 table).


  When the overall total of the table (sample size) is 30.
 When anyone expected cell value is less than 5.

24
ASSUMPTIONS

 Data consist of two population. A sample observation from


population 1 and B sample observation from population 2.
 The samples are random and independent.
 Each observation can be categorized as one of two mutually
exclusive type.

25
26
27
ODD RATIO

28
Difference Between ODDS AND ODDS RATIO

DEF: The odds for success are the ratio Odds ratio that we may compute from the
of the probability of success to the data of a retrospective study.
probability of failure. We use symbol OR to indicate that the
measure is computed from sample data and
The odds of being a case(having used as an estimate of population odds ratio
disease) to being a control(not having
disease) among subjects with risk factor
is [a/(a+b)]/[b/(a+b)]=a/b
The odds of being a case(having
disease) to being a control(not having
disease) among subjects without risk
factor is [c/(c+d)]/[d/(c+d)]=c/d

29
PROPERTIES

 Equal to any non-negative number


 The odds of success are higher in row 1 as compared to row 2 when OR>1
 When one cell has zero probability, OR equals 0 or ∞

30
INTERPRETATION

A value of 1 indicates no association between the risk factor and disease status.
A value less than 1 indicates reduced odds of the disease among subjects with
the risk factor.
A value greater than 1 indicates increased odds of having the disease among
subjects in whom the risk factor is present

31
EXAMPLE

To compute the odds of receiving a death penalty for each groups

32
The odds of death sentence if the defendant was blacks= 28/45=0.6222

The odds of death sentence if the defendant was non-black=22/52=0.4231

The impact of being black on receiving the death penalty is measured by the odds ratio. Such
as ;

INTERPRET
The odds of death sentence for black is 47% higher for blacks as compared to
non-blacks

33
YATE’S METHOD

Cochran suggests that chi square test should not be used if n is small and
expected frequency less than 5.
Yates (1934) proposed a procedure for correcting in case of 2*2 table, That is,

OR

34
CRITERIA FOR YATE’S CORRECTION

 Both variables are dichotomous qualitative (2 cross 2 table).


  When the overall total of the table (sample size) is 30.
 When anyone expected cell value is less than 5.

35
36
37
38
MATCHED-PAIR STUDIES

A matched pairs design is an experimental design that is used when


an experiment only has two treatment conditions. The subjects in
the experiment are grouped together into pairs based on some
variable they “match” on, such as age or gender. Then, within each
pair, subjects are randomly assigned to different treatments.

39
EXAMPLE

Pairs with the same exposure status for both case and control the diagonal cells
are called concordant pairs (c1and c2), and pairs with different exposures the off-
diagonal cells are called discordant pairs (d1 and d2).
40
EXAMPLE

Let be the probability that a discordant pair has an exposed case. Then, from the
preceding table, can be estimated by the following proportion,
 

41
HYPOTHESIS

Under the null hypothesis of no association between the risk factor and the
disease, each discordant pair is just as likely to have a case exposed as to have a
control
  exposed. Thus, the null hypothesis can be written as

42
APPROXIMATION

For
  large samples, we can use the normal approximation.

43
44
45
R × C CONTINGENCY TABLE

We now consider the more general situation where two


classification variables have more than two categories. First, we
consider the situation where both variables are nominal followed by
the situation when one of the variables is ordinal.

46
R × C CONTINGENCY TABLE

Testing Hypothesis of No Association


The same ideas used in the 2 by 2 table still apply to the r by c
contingency table. If there is no association between a row variable
and a column variable, the ratio of the expected cell frequency in
the ith row and jth column, mij, to the ith row total, ni⋅, should
equal the ratio of the jth column total, n⋅j, to the overall total.

47
R × C CONTINGENCY TABLE

There are (r - 1)(c - 1) degrees of freedom for the r by c table because once we
know the frequencies of any (r - 1)(c - 1) cells, we can find the values of the
other frequencies by subtraction from the row and column totals. The hypothesis
of no association between the row and column variables is tested using the chi-
square goodness-of-fi t statistic. Most statisticians perform no adjustment to the
 test statistic when used with tables other than the 2 by 2 table. If the test statistic
is greater than the value of , we reject the hypothesis of no association in favor
of the alternative that the row and column variables are related. If the test statistic
is less than we fail to reject the null hypothesis.

48
49
50
MULTIPLE 2×2 CONTINGENCY
TABLE

Here, we gonna focus on the relationship between 2 factors in the


presence of a third factor. We examined the relationship between 2
categorical variables (factors).

51
EXAMPLE

For example, we might be interested in the relationship between


smoking and lung cancer, and how this relationship may change
with gender (a third factor). We observe that the apparent
(combining) relationship between 2 factors may switch or change
its direction and magnitude depending on third factor.

52
EXAMINE THE RELATIONSHIP

We will test for such a dependency, and, if we don’t


seem to find one, we will analyze the aggregated data; if we do find
such a dependency, then it is appropriate to examine the relationship
of the 2 factors of interest separately for each of the levels of the
third factor (don’t aggregate).
We will focus on 2 factors each with 2 levels, including a third
factor with possibly several (g) levels; thus, we will be working
with multiple 2x2 contingency tables.

53
A study to determine if there is any association between the occurrence of upper respiratory infections (URI) of young children and outdoor
air pollution. There are several variables that could affect the relationship between the occurrence of infections and outdoor air pollution.
(I.E, dust, traffic, smoke etc) hypothetical data for this situation are based on an article by jaakkola et al. (1991) and are shown in table

54
EXAMPLE
55
EXAMPLE
56
EXAMPLE
57
SOLUTION

One way of taking the passive smoke variable into account is to analyze each 2 by 2 table
separately. Then we have two tables i.e, one who smoked and other who don’t smoked
Table.1

Passive smoke City polluted URI URI total


in the home some none

yes high 100 20 120


yes low 124 40 164
total   224 60 284
58
SOLUTION

Calculations:
By using the chi-square and odd ratio formula, we have the XYC -square statistic is 2.039
and its p-value is 0.1533 for homes in which someone smoked. The odds ratio for this data
is 1.613. The 95 percent confidence intervals for the odds ratios is from 0.887 to 2.933

Table 2.

Passive smoke City polluted URI URI total


in the home some none
NO high 128 62 190
NO low 166 119 285
total   294 181 475

59
SOLUTION

Calculations:
By using the chi-square and odd ratio formula, we have the XYC -square statistic is 2.039
and its p-value is 0.1533 for homes in which someone smoked. The odds ratio for this data
is 1.613. The 95 percent confidence intervals for the odds ratios is from 0.887 to 2.933

Table 2.

Passive smoke City polluted URI URI total


in the home some none
NO high 128 62 190
NO low 166 119 285
total   294 181 475

60
SOLUTION
Calculations
The XYC -square value is 3.645, and its p-value is 0.0562 for those without passive smoke
in the home. The odds ratio for this data is 1.480. The 95 percent confidence intervals for
the odds ratios is from 1.007 to 2.171
Interpretation
The first confidence interval, a much wider interval than the second interval, includes the
value of one that suggests that there is no relation between the two variables. The second
interval barely misses including one. The second interval’s smaller size reflects the larger
sample size associated with the
home in which there was no passive smoke. Neither of these tables has a statistically
significant association between the outdoor air pollution and the occurrence of URI at the
0.05 level based on the test statistics. The conclusion from the analyses of the separate
tables is different from that of the combined table.
A problem with the use of the separate tables is that the analyses are based on the smaller
sample sizes associated with each sub-table, not on the sample size of the combined table.
This makes it diffificult to find the presence of small but consistent trends across tables.
61
COCHRAN MENTAL HAENSZEL TEST

Two bio statisticians, Nathan Mantel and William Haenszel, developed a method in 1959
for examining the relation between two categorical variables while controlling for another
categorical variable (Mantel and Haenszel 1959).
This method, like a method published by William Cochran in 1954, uses all the data in the
combined table and produces one overall test statistic. The test is designed to detect the
consistent effect of the independent variable on the dependent variable across the levels of
the extraneous variable.
Thus, this method should only be used when the estimated odds ratios in the Sub-tables
are similar to one another. One very attractive feature of this test is that it can be used with
extremely small sample sizes.

62
PROPERTIES

 For large samples, when H0 is true, CMH has chi-squared distribution with df = 1.


 If all θ(AB(k))=1, then CMH is close to zero
 If some or all θ(AB(k))>1, then CMH is large
 If some or all θ(AB(k))<1, then CMH is large
 If some θ(AB(k))<1 and others θ(AB(k))>1, then CMH is NOT an appropriate test;
that is, the test works well if the conditional odds ratios are in the same direction and
comparable in size.
This test has also been generalized for application to three-way tables of size other than 2
by 2 by k (Landis, Heyman, and Koch 1978)

63
WHEN TO USE

Use the Cochran–Mantel–Haenszel test (which is sometimes called the Mantel–


Haenszel test) for repeated tests of independence. The most common situation is
that you have multiple 2×2 tables of independence; we're analyzing the kind of
experiment that we had to analyze with a test of independence, and we have done
the experiment multiple times or at multiple locations. There are three nominal
variables: the two variables of the 2×2 test of independence, and the third
nominal variable that identifies the repeats (such as different times, different
locations, or different studies).

64
CMH

We have one Z* test statistics, but we are dealing with discrete variables, we should use
the continuity correction with Z*. However, instead of using the continuity-corrected
Z* statistic, we would prefer to use a chi-square statistic, since all the other tests
associated with contingency tables use a chi-square statistic. This poses no problem, since
the square of a standard normal variable follows a chi-square distribution with one degree
of freedom. Thus, the statistic to be used to test the hypothesis of no association between
air pollution and the occurrence of upper respiratory problems is the Cochran-Mantel-
Haenszel chi-square statistic.
65
CMH
. Also called the Mantel-Haenszel statistic, it is defined by

where Oi and Ei are the observed and expected values in the (1,1) cell in the ith
sub-table.
In terms of the entries in the ith table, Ei is defined as,

66
VARIANCE

Vi, with a variance of Oi minus Ei, can be as,

In XCMH-square O, E, and V are defined as the sums of the Oi, the Ei and the Vi
over the k subtables. If XCMH-square is greater than chi-square table value, we
reject the hypothesis of no association between air
pollution and the occurrence of upper respiratory infections. Otherwise we fail to
reject the null hypothesis.

67
EXAMPLE
68
EXAMPLE
69
EXAMPLE
70
MENTAL HEANSZEL COMMON ODD RATIOS

Mantel and Haenszel also showed how to combine the data from the separate sub tables to
form a common odds ratio for the data. Again, this should only be done when the
estimated odds ratios in the sub tables are similar. If the estimated odds ratios for the sub
tables are not similar — for example, some are less than one and some are greater than one
— the common odds ratio would not be very useful. The relation between the independent
and dependent variable would depend on the level of the extraneous variable, and the use
of a common odds ratio would mask this. The Mantel-Haenszel
estimator of the common odds ratio, θ is,

71
DISADVANTAGES

• There is a limit to the kind of statistical analysis that can be performed on


categorical data. 
• The options in categorical data do not have a standardized interval scale.
Therefore, respondents are not able to effectively gauge their options before
responding.
• Quantitative analysis cannot be performed on categorical data. Therefore,
numerical or arithmetic operations can not be performed.

72
REFERENCES
• https://www3.nd.edu/~rwilliam/stats1/x51.pdf
• https://www.investopedia.com/terms/g/goodness-of-fit.asp
• https://www.statsdirect.com/help/chi_square_tests/22.htm
• https://www.statsdirect.com/help/chi_square_tests/22.htm
• https://onlinestatbook.com/2/chi_square/contingency.html
• https://www2.stat.duke.edu/courses/Spring02/sta102/chap16.pdf

73
RECOMMENDATION

• https://ncss-wpengine.netdna-ssl.com/wp-
content/themes/ncss/pdf/Procedures/NCSS/Contingency_Tables-Crosstabs-
Chi-Square_Test.pdf

74
THE END

75

You might also like