Week 6 Lecture
Week 6 Lecture
• Test statistic measures the discrepancy between the observed and expected
frequencies
• Form a hypothesis to test for which we can generate the expected frequencies of the
various categories
• Then
• Compare the test statistic computed from the data with the relevant percentile of the
Chi-Squared distribution
• OR
• Compute the P-Value = probability the test statistic takes that value given the test
statistic has Chi-squared distribution
• to decide whether the evidence from the sample is consistent with the hypothesis
• Then compare the numerical value of this test statistic T with the CV from the CHISQ
distribution
OFFICE I FACULTY I DEPARTMENT
Chi Square Goodness of Fit Test
• The hypothesis used to generate the expected frequencies is called the “null
hypothesis”
• The probability value 𝛼𝛼 used to compute the critical value is called the “level of
significance”
• The critical value is computed as the 100 × (1 − 𝛼𝛼)% percentile of the CHISQ
distribution with k=N-1 df
• If T>CV we take this as evidence that the hypothesis used to generate the expected
frequencies is doubtful and the probability that we would get this value of the test
statistic from the data using that hypothesis is low, eg. less than 𝛼𝛼 = 5%. - Reject the
hypothesis
• Out of 300 rolls of the die the actual / observed frequencies are
• Pvalue
• We can compute Pr(chisq(df = 5) ≤ 5.08) using the excel code = CHISQ.DIST(5.08, 5,
1). The result is 59.38%.
• We can compute Pr(chisq(df = 5) > 5.08) using the excel code = CHISQ.DIST.RT(5.08,
5). The result is 40.62%.
• OR
• We can compute the probability that the test statistic is at this level or higher
• P-Value = Pr(CHISQ > TS) = Pr(CHISQ > 5.08) = 40.62%
• P-Value is uch higher than the “level of significance” α = 5%
• Hence we don’t find evidence against the null hypothesis as the pvalue is not lower
than 5%
• Generalise the previous Chi-square technique to the case where two variables
are involved
• CHISQ.DIST.RT(TS,
DF)
• p-value = 5.78%
• If α = 5%, there is
insufficient evidence to
conclude that sales
performance depends on
grad status
• In special cases with some low expected frequency, the chi-square formula often
overestimates the statistical significance of the data. This leads to a conclusion where
H0 is mistakenly rejected.
• Yates’ correction factor is mainly used when at least one cell of the contingency table has
an expected frequency of less than 5.
• In situations with large sample sizes, using the correction will have little effect.
• Sampling generally costs in money and in time. The reason for sampling instead of
using the entire population for measurement is mainly due to the costs involved and
the time taken.
• Question: what value to use for s, the estimated standard deviation of the
data?
OFFICE I FACULTY I DEPARTMENT
Sample size for means
• Define D as the distance from the true value of the mean we want to accept
• So