Chi Square Test
Chi Square Test
A chi-square (χ2) statistic is a test that measures how a model compares to actual
observed data. The data used in calculating a chi-square statistic must be
random,
raw,
mutually exclusive,
drawn from independent variables,
drawn from a large enough sample.
For these tests, degrees of freedom are utilized to determine if a certain null
hypothesis can be rejected based on the total number of variables and samples within
the experiment. As with any statistic, the larger the sample size, the more reliable the
result is.
Historical aspects: -
In the 19th century, statistical analytical methods were mainly applied in biological data
analysis and it was customary for researchers such as Sir George Airy and Professor
Merriman to assume that observations followed a normal distribution. Later on, Karl
Pearson criticized the work of those researchers in his 1900 paper.
At the end of 19th century, Pearson noticed the existence of significant skewness within some
biological observations.
In order to model the observations... regardless of being normal or skewed, Pearson, in a
series of articles published from 1893 to 1916 devised the Pearson distribution, a family of
continuous probability distributions. This includes the normal distribution and many skewed
distributions.
He also proposed a method of statistical analysis consisting of using the Pearson distribution
to model the observation.
In 1900, Pearson published a paper on the χ2 test which is considered to be one
of the foundations of modern statistics. In this paper, Pearson investigated a
test of goodness of fit to determine how well the model really fits to the
observations.
purpose:-
The Chi-square test is intended to test how likely it is that an
observed distribution is due to chance. It is also called a "goodness of fit" statistic,
because it measures how well the observed distribution of data fits with
the distribution that is expected if the variables are independent.
Application area:-The Chi-square is used most commonly to compare the
incidence (or proportion) of a characteristic in one group to the incidence (or
proportion) of a characteristic in other group(s).
1. Test for independence of attributes:-
With the help of X2 test we can find out whether 2 or more attributes are
associated or not
2. X2test as goodness of fit:-
The X2test for goodness of fit enables us to determine the extent to which the
theoretical probability distributions coincides with empirical sample
distributions.
3. For yate’s correction for conformity
The distribution of X2 statistics is continuous but the data under the test is
categorical which is discrete.
It causes error due to discrete data and if it is a 2*2 contingency table then we can
apply yate’s correction for continuity.
4. For population variance: -
This is considered as parametric test.
The assumption underlying the X2 test is that the population from which
sample is drawn is normally distributed.
1. Define Hypothesis.
2. Build a Contingency table.
3. Find the expected values.
4. Calculate the Chi-Square statistic.
5. Accept or Reject the Null Hypothesis.
Consider a data-set where we have to determine why customers are leaving the bank,
let’s perform a Chi-Square test for two variables. Gender of a customer with values
as Male/Female as the predictor and Exited describes whether a customer is
leaving the bank with values Yes/No as the response. In this test we will check is
there any relationship between Gender and Exited.
Define Hypothesis
Null hypothesis: Assumes that there is no association between the two variables.
2.degrees of freedom.
If the observed chi-square test statistic is greater than the critical value, the null
hypothesis can be rejected.
If the observed chi-square test statistic is less than the critical value, the null
hypothesis can be Accepted.
2. Contingency table
A table showing the distribution of one variable in rows and another in columns. It is
used to study the relation between two variables.
Contingency table for observed values
In the above table we have figured out all observed values and our next steps are to
find expected values, get the Chi-Square value and check for relationship.
Based on the null hypothesis that the two variables are independent. We can say if A,
B are two independent events.
Formula:
E=RT*CT/N
Let’s calculate the expected value for the first cell that is those who are Males and
are Exited from the bank
E11= 216*82/400= 44
E12= 216*318/400=178
E21=184*82/400=38
E22= 184*318/400=146
We get the following results.
Expected values
Summarizing the observed values and calculated expected values into a table and
determine the Chi-Square value.
We can see Chi-Square is calculated as 2.22 by using the Chi-Square statistic formula.
(r-1) * (c-1)
Where
With 95% confidence that is alpha = 0.05, we will check the calculated Chi-Square
value falls in the acceptance or rejection region.
The Chi-Square values can be determined with the Chi-Square table.
Having degrees of freedom =1(calculated with contingency table) and alpha =0.05 the
Chi-Square value is 3.84.
In the above fig, we can see Chi-Square ranges from 0 to inf and alpha ranges from 0
to 1 in the opposite direction. We will reject the Null hypothesis if Chi-Square value
falls in the error region
So here we are accepting the null hypothesis since the calculated Chi-Square
value is less than the critical Chi-Square value.
2.22<3.84
Limitations
E11=RT*CT/N = 120*500/2000 = 30
Calculation of X2
O E (O-E) (O-E)2 (O-E)2/E
20 30 -10 100 3.33
100 90 +10 100 1.11
480 470 +10 100 0.21
1400 1410 -10 100 0.07
Σ (O-E)2/E =4.72
X2 = Σ (O-E)2/E =4.72
d.f=(C-1)(R-1)=(2-1)(2-1)=1
X2O.O5=3.84
H0 =fail and rejected
Therefore, The Conclusion is Quinine is useful in Malaria.