Open In App

Chi-Square Test

Last Updated : 02 Aug, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Chi-squared test indicates that there is a relationship between two entities. Handling data often involves testing hypotheses to extract useful information. In categorical analysis, chi-square tests are used to determine whether observed frequencies differ significantly from expected frequencies under a given hypothesis.

Chi-squared test, or χ² test, helps in determining whether these two variables are associated with each other.

This test is widely used in market research, healthcare, social sciences, and more to analyze categorical relationships.

Chi-Square-Test
Chi-Square Test

For example, Entity 1: People’s favorite colors and Entity 2: Their preference for ice cream.

  • Null Hypothesis (H₀): Favorite color and ice cream preference are independent (no relationship).
  • Alternative Hypothesis (H₁): They are dependent (a relationship exists).

By comparing observed survey data with expected frequencies (if no relationship existed), the Chi-Square test calculates a test statistic (χ²). If this value is large enough, we reject H₀, concluding that color preference does influence ice cream choice and vice versa.

Formula For Chi-Square Test

\chi^2 = \sum \frac{ (O_i - E_i)² }{ E_i }

Symbols are broken down as follows:

  • Oi: Observed frequency
  • Ei: Expected frequency

Categorical Variables

Categorical variables classify data into distinct, non-numerical groups (e.g., colors, fruit types).

Key Characteristics:

  1. Distinct Groups: No overlap (e.g., hair color: blonde, brunette).
  2. Non-Numerical: No inherent order (e.g., "apple" ≠ "orange" numerically).
  3. Limited Options: Fixed categories (e.g., traffic lights: red, yellow, green).

Example:

"Do you prefer tea, coffee, or juice?" → Categories: tea/coffee/juice.

Steps for Chi-Square Test

Steps and an illustration of an example of how sex influences which type of ice-cream a person will choose using a chi-square test are added below:

Step 1: Define Hypothesis

  • Null Hypothesis (H₀): The observed frequencies match the expected distribution.
  • Alternative Hypothesis (H₁): The observed frequencies do not match the expected distribution.

Step 2: Gather and Organize Data

Gather Information about the Two Category Variables:

Before performing a chi-square test, you should have on hand information about two categorical variables you wish to observe.

  • You must collect details on people’s sex (male or female) and their best flavors (e.g., chocolate, vanilla, strawberry).
  • Once this information is collected, it can be inserted into a contingency table.

The hypothesis is that men prefer vanilla while women prefer chocolate. So we need to record how many have chosen vanilla among all male respondents versus the number who chose chocolate out of all female respondents.

Here's an example of what a contingency table might look like:


Chocolate

Vanilla

Strawberry

Male

20

15

10

Female

25

20

30

Step 3: Calculate Expected Frequencies

  • Get Observed Frequency: In any specific cell, the expected frequency can be described as the number of occurrences that would be expected if the two variables were independent.
  • Expected Frequency Calculation: This involves multiplying the sums of rows and columns in proportion, then dividing by the total number of observations in a table.

Observed frequency is the table given above.

E_{ij}=\frac{(Row Total)×(Column Total)}{Grand Total}

  • Male and chocolate: \frac{45×45}{120} = 16.875
  • Male and Vanilla: \frac{45×35}{120} = 13.125

Summarizing,

  • Male: Chocolate: 16.875, Vanilla: 13.125, Strawberry: 15.0,
  • Female: Chocolate:12.125, Vanilla: 21.875, Strawberry: 25.0

Step 4: Perform Chi-Square Test

Use Chi-Square Formula:

χ² = Σ (Oi - Ei)² / Ei

\chi^2 = \sum \frac{(O -E)^2}{E} = \frac{(20 -16.875)^2}{16.875} + \frac{(15 -13.125)^2}{13.125} + \frac{(30 -25)^2}{25} = 4.69

Step 5: Determine Degrees of Freedom (df)

df = (number of rows - 1) × (number of columns - 1)

df=(r−1)(c−1)=(2−1)(3−1)=2

Step 6: Find p-value

  • Compare χ² to the Chi-Square Distribution Table for the given df.

χ² = 4.69 with df=2: Critical value at α=0.05 is 5.991. Since 4.69 < 5.991, p > 0.05 

Step 7: Interpret Results

  • If the p-value is less than a certain significance level (e.g., 0.05), then we reject the null hypothesis, which is commonly denoted by α. Thus, it means that category variables highly correlate with each other.
  • When a p-value is above α, it implies that we cannot reject the null hypothesis; hence, there is insufficient evidence for establishing the relationship between these variables.

No significant evidence supports the claim that men prefer vanilla or women prefer chocolate (p>0.05).

Addressing Assumptions and Considerations

  • Chi-square tests suppose that the observations are independent of one another; they are distinct.
  • Each cell in the table should have a minimum of five values in it for better results. Otherwise, think about Fisher’s exact test as an alternative measure if a table cell has fewer than five numbers in it.
  • Chi-square tests do not indicate a causal relationship, but they identify an association between variables.

Goodness-Of-Fit

A goodness-of-fit test checks if a hypothesized model matches observed data. For example, testing whether urban residents are taller than rural ones by comparing actual height data to predictions.

Key Aspects:

  1. Purpose: Validate if data fits an expected distribution.
  2. Data Types: Works for both categorical (e.g., survey responses) and continuous (e.g., heights) data.
  3. Applications: Compare observed vs. expected frequencies (e.g., Chi-Square test) and assess if data follows a specific distribution (e.g., normal distribution).
  4. Benefits: Identifies model-data mismatch.

Applications of Chi-Square Test in Computer Science

A/B Testing & Feature Evaluation

  • Compare user engagement (e.g., clicks, conversions) between two website versions (A vs. B).
  • Chi-test is used to test if observed metrics (e.g., "Click" vs. "No Click") differ significantly between groups.
  • Example: Observed: Version A: 120 clicks / 1,000 views; Version B: 150 clicks / 1,000 views. Chi-Square: Checks if the difference is statistically significant (not due to chance).

Machine Learning (Feature Selection)

  • Identify categorical features correlated with target variables.
  • Test if independence between features (e.g., "Browser Type" vs. "Purchase Decision") using the Chi-square test.
  • Example: χ² p-value < 0.05 → "Browser Type" significantly affects purchases.

Database Query Optimization

  • Assess if data is evenly distributed across partitions.
  • Chi-square is used to test if actual row counts per partition match the expected uniform distribution.
  • Example: Uneven distribution (χ² significance) suggests a poor sharding strategy.

Natural Language Processing (NLP)

  • Evaluate word frequency distributions in texts.
  • Compare observed word counts (e.g., "error" in logs) to the expected Poisson distribution.
  • Example: Detects overused terms in spam emails (χ² highlights deviations from normal usage).

Solved Examples on Chi-Square Test

Example 1: A study investigates the relationship between eye color (blue, brown, green) and hair color (blonde, brunette, Redhead). The following data is collected:

Eye Color

Blonde

Brunette

Redhead

Total

Blue

35

52.5

12.5

100

Brown

28.1

42.1

9.8

80

Green

6.9

10.4

2.7

20

Solution:

Calculate the chi-square value for each cell in the contingency table using the formula

χ² = (Oi - Ei)² / Ei

For instance, consider someone with brown hair and blue eyes:

χ² = (15 - 28.1)² / 28.1 ≈ 6.07.

To complete the total chi-square statistic, find each cell’s chi-squared value and sum them up across all the nine cells in the table.

Degrees of Freedom (df):

df = (number of rows - 1) × (number of columns - 1)

df = (3 - 1) × (3 - 1)

df = 2 × 2 = 4

Finding p-value:

You may reference a chi-square distribution table to get an estimated chi-square stat of (χ²) using the appropriate degrees of freedom. Look for the closest value and its corresponding p-value since most tables do not show precise numbers.

If your Chi-square value was 20.5, you would observe that the nearest number in the table for df = 4 is 14.88 with a p-value in 0.005; an illustration is.

Interpreting Results:

  • Selecting a level of significance (α = 0.05 is common)or than if the null hypothesis holds, the probability of either rejecting it at all is limited (Type I error).
  • Compare the alpha value and p-value.
  • When the p-value is less than the significance level, which in this case is written as p-value < 0.05, we can reject the null hypothesis. There is sufficient evidence to say that hair and eye color are related in one direction according to statistical terms. If the p-value is greater than the significance level it means that we cannot reject the null hypothesis therefore p-value > 0.05.
  • Based on the data at hand, we cannot say that there is a statistically significant correlation between eye and hair colors.

Example 2: 100 flips of a coin are performed. The coin is fair, with an equal chance of heads and tails, according to the null hypothesis. 55 heads and 45 tails are the observed findings.

Solution:

Let's imagine a coin. this coin has two sides, one which has tails and the other that has heads on them, when flipping this coin there is a 50/50 chance of obtaining either outcome.

This is why most of us would like characteristic information about it because then they predict the result based on their prior knowledge or experiences even before actually doing so- such things include whether the person who tossing has been motivated enough as well as what he/she hopes will happen next if head or tail shows up. However, there are times when people make different decisions in a hurry without thinking about future consequences and that could be possible when dealing with rare coin.

Afterwards, the anticipated values will be juxtaposed with the ones from making several flips at the dice case. Dissimilar results from those that would be attributable to mere chance may perhaps indicate that this might otherwise.

Related Articles:

Practice Problems on Chi-Square Test

Q1. Market Research on Beverages

A company conducts a survey to determine whether there's a relationship between age groups and preferred beverages. The data collected is as follows:

Age Group

Coffee

Tea

Soft Drinks

Water

18-25

30

20

25

15

26-35

25

30

20

25

36-45

20

25

30

25

46-55

15

20

25

40

Use a chi-square test to determine if there is an association between age groups and preferred beverages.

Q2. Student Performance

A teacher wants to find out if there is a relationship between study habits and grades. The data collected is as follows:

Study Habits

A

B

C

D

F

Regular

15

20

25

10

5

Occasional

10

15

20

15

10

Rare

5

10

15

20

25

Perform a chi-square test to determine if study habits and grades are associated.

Q3. Gender and Major

A university wants to see if there is an association between gender and chosen major. The data collected is:

Major

Male

Female

Engineering

60

30

Business

40

50

Arts

20

40

Sciences

30

30

Conduct a chi-square test to examine if gender and chosen major are related.

Q4. Voting Preferences

A political analyst wants to know if there is a relationship between gender and voting preference. The data is:

Preference

Male

Female

Candidate A

80

90

Candidate B

70

60

Undecided

50

40

Test the hypothesis that gender and voting preference are independent.

Q5. Diet and Exercise

A health study examines the relationship between diet type and exercise frequency. The data is:

Exercise Frequency

Vegan

Vegetarian

Omnivore

Regular

40

30

50

Occasionally

30

40

30

Never

20

30

20

Use a chi-square test to determine if diet type and exercise frequency are associated.

Q6. Customer Preferences

A retailer wants to determine if there is an association between customer age and preferred store section. The data is:

Age Group

Electronics

Clothing

Groceries

18-25

50

30

20

26-35

40

40

20

36-45

30

50

20

46-55

20

40

40

Perform a chi-square test to investigate the association between age group and preferred store section.

Q7. Employment Status and Education Level

A survey is conducted to find out if there is a relationship between employment status and education level. The data is:

Education Level

Employed

Unemployed

High School

30

20

Bachelor's

40

30

Master's

20

10

PhD

10

5

Test if there is an association between education level and employment status.

Q8. Favorite Sport

A researcher wants to find out if there is a relationship between gender and favorite sport. The data is:

Sport

Male

Female

Football

50

30

Basketball

40

30

Tennis

30

40

Swimming

20

50

Conduct a chi-square test to determine if gender and favorite sport are associated.

Q9. Internet Usage and Device Type

A study investigates the relationship between internet usage frequency and preferred device type. The data is:

Usage Frequency

Smartphone

Tablet

Laptop

Desktop

Daily

50

20

60

30

Weekly

30

30

20

20

Monthly

20

30

10

20

Use a chi-square test to examine if internet usage frequency and preferred device type are related.

Q10. Smoking Habits

A health survey examines the relationship between smoking habits and exercise frequency. The data is:

Smoking Habits

Regular Exercise

Occasional Exercise

No Exercise

Smoker

30

40

30

Non-Smoker

50

30

20


Chi-Square Test in Maths

Similar Reads