Chi-Square Test of Independence
Chi-Square Test of Independence
This test utilizes a contingency table to analyze the data. A contingency table (also known as
a cross-tabulation, crosstab, or two-way table) is an arrangement in which data is classified
according to two categorical variables. The categories for one variable appear in the rows, and
the categories for the other variable appear in columns. Each variable must have two or more
categories. Each cell reflects the total count of cases for a specific pair of categories.
There are several tests that go by the name "chi-square test" in addition to the Chi-Square
Test of Independence. Look for context clues in the data and research question to make sure
what form of the chi-square test is being used.
Common Uses
The Chi-Square Test of Independence can only compare categorical variables. It cannot make
comparisons between continuous variables or between categorical and continuous variables.
Additionally, the Chi-Square Test of Independence only assesses associations between
categorical variables, and can not provide any inferences about causation.
If your categorical variables represent "pre-test" and "post-test" observations, then the chi-
square test of independence is not appropriate. This is because the assumption of the
independence of observations is violated. In this situation, McNemar's Test is appropriate.
Data Requirements
Hypotheses
The null hypothesis (H0) and alternative hypothesis (H1) of the Chi-Square Test of
Independence can be expressed in two different but equivalent ways:
OR
Test Statistic
2
The test statistic for the Chi-Square Test of Independence is denoted Χ , and is computed as:
χ2=∑i=1R∑j=1C(oij−eij)2eijχ2=∑i=1R∑j=1C(oij−eij)2eij
where
th th
oijoij is the observed cell count in the i row and j column of the table
eijeij is the expected cell count in the i row and j column of the table, computed as
th th
2 2
The calculated Χ value is then compared to the critical value from the Χ distribution table
with degrees of freedom df = (R - 1)(C - 1) and chosen confidence level. If the
2 2
calculated Χ value > critical Χ value, then we reject the null hypothesis.
Data Set-Up
There are two different ways in which your data may be set up initially. The format of the
data will determine how to proceed with running the Chi-Square Test of Independence. At
2
minimum, your data should include two categorical variables (represented in columns) that
will be used in the analysis. The categorical variables must include at least two groups. Your
data may be formatted in either of the following ways:
Cases represent subjects, and each subject appears once in the dataset. That is, each
row represents an observation from a unique subject.
The dataset contains at least two nominal categorical variables (string or numeric).
The categorical variables used in the test must have two or more categories.
An example of using the chi-square test for this type of data can be found in the Weighting
Cases tutorial.
3
In SPSS, the Chi-Square Test of Independence is an option within the Crosstabs procedure.
Recall that the Crosstabs procedure creates a contingency table or two-way table, which
summarizes the distribution of two categorical variables.
A Row(s): One or more variables to use in the rows of the crosstab(s). You must enter at
least one Row variable.
B Column(s): One or more variables to use in the columns of the crosstab(s). You must
enter at least one Column variable.
Also note that if you specify one row variable and two or more column variables, SPSS will
print crosstabs for each pairing of the row variable with the column variables. The same is
true if you have one column variable and two or more row variables, or if you have multiple
row and column variables. A chi-square test will be produced for each table. Additionally, if
you include a layer variable, chi-square tests will be run for each pair of row and column
variables within each level of the layer variable.
Not sure which variable should be the "row" and which should be the "column"? You can
"exchange" the row and column variables without affecting the results of the chi-square test
of independence - the test statistic and p-value will be identical.
4
D Cells: Opens the Crosstabs: Cell Display window, which controls which output is
displayed in each cell of the crosstab. (Note: in a crosstab, the cells are the inner sections of
the table. They show the number of observations for a given combination of the row and
column categories.) There are three options in this window that are useful (but optional)
when performing a Chi-Square Test of Independence:
1 Observed: The actual number of observations for a given cell. This option is enabled by
default.
2 Expected: The expected number of observations for that cell (see the test statistic
formula).
5
3 Unstandardized Residuals: The "residual" value, computed as observed minus
expected.
PROBLEM STATEMENT
In the sample dataset, respondents were asked their gender and whether or not they were a
cigarette smoker. There were three answer choices: Nonsmoker, Past smoker, and Current
smoker. Suppose we want to test for an association between smoking behavior (nonsmoker,
current smoker, or past smoker) and gender (make or female) using a Chi-Square Test of
Independence (we'll use α = 0.05).
Before we test for "association", it is helpful to understand what an "association" and a "lack
of association" between two categorical variables looks like. One way to visualize this is using
clustered bar charts. Let's look at the clustered bar chart produced by the Crosstabs
procedure.
This is the chart that is produced if you use Smoking as the row variable and Gender as the
column variable (running the syntax later in this example):
The "clusters" in a clustered bar chart are determined by the row variable (in this case, the
smoking categories). The color of the bars is determined by the column variable (in this case,
gender). The height of each bar represents the total number of observations in that particular
combination of categories.
6
This type of chart emphasizes the differences within the categories of the row variable. Notice
how within each smoking category, the heights of the bars (i.e., the number of males and
females) are very similar. That is, there are an approximately equal number of male and
female nonsmokers; approximately equal number of male and female past smokers;
approximately equal number of male and female current smokers. If there were an
association between gender and smoking, we would expect these counts to differ between
groups in some way.
1. Open the Crosstabs dialog (Analyze > Descriptive Statistics > Crosstabs).
2. Select Smoking as the row variable, and Gender as the column variable.
3. Click Statistics. Check Chi-square, then click Continue.
4. (Optional) Check the box for Display clustered bar charts.
5. Click OK.
SYNTAX
CROSSTABS
/TABLES=Smoking BY Gender
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ
/CELLS=COUNT
/COUNT ROUND CELL
/BARCHART.
OUTPUT
TABLES
The first table is the Case Processing summary, which tells us the number of valid cases used
for analysis. Only cases with nonmissing values for both smoking behavior and gender can be
used in the test.
The next tables are the crosstabulation and chi-square test results.
7
The key result in the Chi-Square Tests table is the Pearson Chi-Square.
Since the p-value is greater than our chosen significance level (α = 0.05), we do not reject the
null hypothesis. Rather, we conclude that there is not enough evidence to suggest an
association between gender and smoking.
PROBLEM STATEMENT
8
Let's continue the row and column percentage example from the Crosstabs tutorial, which
described the relationship between the
variables RankUpperUnder (upperclassman/underclassman) and LivesOnCampus (lives on
campus/lives off-campus). Recall that the column percentages of the crosstab appeared to
indicate that upperclassmen were less likely than underclassmen to live on campus:
Suppose that we want to test the association between class rank and living on campus using a
Chi-Square Test of Independence (using α = 0.05).
The clustered bar chart from the Crosstabs procedure can act as a complement to the column
percentages above. Let's look at the chart produced by the Crosstabs procedure for this
example:
The height of each bar represents the total number of observations in that particular
combination of categories. The "clusters" are formed by the row variable (in this case, class
rank). This type of chart emphasizes the differences within the underclassmen and
9
upperclassmen groups. Here, the differences in number of students living on campus versus
living off-campus is much starker within the class rank groups.
1. Open the Crosstabs dialog (Analyze > Descriptive Statistics > Crosstabs).
2. Select RankUpperUnder as the row variable, and LiveOnCampus as the column
variable.
3. Click Statistics. Check Chi-square, then click Continue.
4. (Optional) Click Cells. Under Counts, check the boxes for Observed and Expected,
and under Residuals, click Unstandardized. Then click Continue.
5. (Optional) Check the box for Display clustered bar charts.
6. Click OK.
OUTPUT
SYNTAX
CROSSTABS
/TABLES=RankUpperUnder BY LiveOnCampus
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ
/CELLS=COUNT EXPECTED RESID
/COUNT ROUND CELL
/BARCHART.
TABLES
The first table is the Case Processing summary, which tells us the number of valid cases used
for analysis. Only cases with nonmissing values for both class rank and living on campus can
be used in the test.
The next table is the crosstabulation. If you elected to check off the boxes for Observed Count,
Expected Count, and Unstandardized Residuals, you should see the following table:
10
With the Expected Count values shown, we can confirm that all cells have an expected value
greater than 5.
Computation of the expected cell counts and residuals (observed minus expected) for
the crosstabulation of class rank by living on campus.
Underc row 1
Row 1, column 1 Row 1, column 2
lassma total =
n o11=79o11=79 o12=148o12=148 227
e11=227∗231388=135.147e11 e12=227∗157388=91.853e12=
=227∗231388=135.147 227∗157388=91.853
r11=79−135.147=−56.147r11 r12=148−91.853=56.147r12=1
=79−135.147=−56.147 48−91.853=56.147
Upperc row 2
Row 2, column 1 Row 2, column 2
lassme total =
n o21=152o21=152 o22=9o22=9 161
e21=161∗231388=95.853e21= e22=161∗157388=65.147e22=
161∗231388=95.853 161∗157388=65.147
r21=152−95.853=56.147r21=1 r22=9−65.147=−56.147r22=9
52−95.853=56.147 −65.147=−56.147
These numbers can be plugged into the chi-square test statistic formula:
11
χ2=∑i=1R∑j=1C(oij−eij)2eij=(−56.147)2135.147+(56.147)291.853+(56.147)29
5.853+
(−56.147)265.147=138.926χ2=∑i=1R∑j=1C(oij−eij)2eij=(−56.147)2135.147+(56.147)291.853
+(56.147)295.853+(−56.147)265.147=138.926
These numbers can be plugged into the chi-square test statistic formula:
χ2=∑i=1R∑j=1C(oij−eij)2eij=(−56.147)2135.147+(56.147)291.853+(56.147)29
5.853+
(−56.147)265.147=138.926χ2=∑i=1R∑j=1C(oij−eij)2eij=(−56.147)2135.147+(56.147)291
.853+(56.147)295.853+(−56.147)265.147=138.926
12
Since the p-value is less than our chosen significance level α = 0.05, we can reject the null
hypothesis, and conclude that there is an association between class rank and whether or not
students live on-campus.Based on the results, we can state the following:
There was a significant association between class rank and living on campus (Χ (1) =
2
138.9, p < .001).
13