The Logic of Chi-Square Test
The Logic of Chi-Square Test
April 2011
Chi-Square test is performed to assess whether two categorical variables are related to
each other.
Assume we have a population, members of which can be classified according to their
certain properties such as gender, income level, etc. Let us start with the following piece
of information.
If job title and gender were totally independent, we would expect to see a proportional
distribution of job titles among genders. For example out of 700 male employees we would
expect 70 to be CEOs since 10% of the total population consists of CEOs. Similarly, we
would expect 210 males to be middle managers and 420 to be supervisors if gender had
no effect on position. For females, we would expect to see 30 CEOs, 90 middle managers
and 180 supervisors.
This information can be presented by constructing a contingency table.
∗
Bogazici University, Department of Management, Bebek, Istanbul, 34342. Email:
[email protected]. Phone: +90 212 3597508.
1
The Logic of Chi-Square Test
CEO 70 30 100
This is what we should observe if gender and job title were totally independent. As
we get away from this condition (as the values we observe differs from what we expect)
our test statistic should increase in value.
The easiest way to measure the distance between the observed and expected values is
to take their differences. However, some differences would have a negative sign which may
be troubling to interpret. To get rid of the negative sign, we can just take the squares of
the differences. This also inflicts a penalty for greater differences. At the final step we
sum these squared differences to have an idea of the total difference between the observed
and expected values.
Chi-Square test statistic is exactly calculated according to this principle. To see how
its value changes as we get away from the independency condition, consider the following
two cases.
In this case the numbers in black are observed values whereas those in red are expected
values we calculated previously.
2
The Logic of Chi-Square Test
X (fobs − fexp )2
χ2ST AT =
all cells
fexp
(80−70)2 (20−30)2
70
+ 30
2
χ2ST AT = + (220−210) (80−90)2 = 9.5238
210
+ 90
2
(200−180)2
+ (400−420)
420
+ 180
X (fobs − fexp )2
χ2ST AT =
all cells
fexp
(90−70)2 (10−30)2
70
+ 30
2
χ2ST AT = + (230−210) (70−90)2 = 38.0952
210
+ 90
2
(220−180)2
+ (380−420)
420
+ 180
Doubling the deviations more than quadrupled the Chi-Square test statistic.
2 Conclusion
In conclusion, as the values get away from the ideal independent condition, chi-square
test statistic increases drastically. This means rejection of the null hypothesis stating
“the two variables are independent” becomes more likely.