0% found this document useful (0 votes)
4 views

Week 6 Lecture

The document covers introductory categorical analysis, focusing on categorical variables, their analysis, and the Chi-Square Goodness of Fit Test. It explains hypothesis testing, contingency tables for independence, and sample size selection for accuracy in data analysis. Additionally, it discusses Yates' correction factor for low expected frequencies and methods for determining sample sizes for means and proportions.

Uploaded by

Aaliyan Bandealy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Week 6 Lecture

The document covers introductory categorical analysis, focusing on categorical variables, their analysis, and the Chi-Square Goodness of Fit Test. It explains hypothesis testing, contingency tables for independence, and sample size selection for accuracy in data analysis. Additionally, it discusses Yates' correction factor for low expected frequencies and methods for determining sample sizes for means and proportions.

Uploaded by

Aaliyan Bandealy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

BUSA6004 Managing Data

LECTURE WEEK 6 – INTRODUCTORY CATEGORICAL ANALYSIS


What is a categorical variable
• Individuals are classified into categories, descriptive, not numerical
• E.g

OFFICE I FACULTY I DEPARTMENT 2


What sort of analysis of categorical
variables?
• E.g.
• Do people of different age groups differ in their view of human cloning?
• Is there a significantly higher proportion of male wine-drinkers than female
wine-drinkers?

• Is there a relationship between age and view on human cloning?


• Is there a relationship between gender and wine drinking?

OFFICE I FACULTY I DEPARTMENT


Presentation of single variable categorical
data

• E.g. Outcomes of 60 rolls of a six-sided die

OFFICE I FACULTY I DEPARTMENT


Contingency table
• Two categorical variables
• E.g. Netflix survey

OFFICE I FACULTY I DEPARTMENT


Chi Square Goodness of Fit Test

• Analysis of single variable problem

• This test involves comparing the observed frequencies of occurrence of the


categories and the expected frequencies

• Test statistic measures the discrepancy between the observed and expected
frequencies

• Test statistic follows the “Chi-square distribution with k degrees of freedom”

OFFICE I FACULTY I DEPARTMENT


Hypothesis test

• Form a hypothesis to test for which we can generate the expected frequencies of the
various categories
• Then
• Compare the test statistic computed from the data with the relevant percentile of the
Chi-Squared distribution
• OR
• Compute the P-Value = probability the test statistic takes that value given the test
statistic has Chi-squared distribution
• to decide whether the evidence from the sample is consistent with the hypothesis

OFFICE I FACULTY I DEPARTMENT


Chi Square Goodness of Fit Test
• Test Statistic

• The test statistic follows Chi-Square distribution with K = N – 1 degree of freedom. CV


is defined so that Pr(T>CV) = α

• Then compare the numerical value of this test statistic T with the CV from the CHISQ
distribution
OFFICE I FACULTY I DEPARTMENT
Chi Square Goodness of Fit Test

• The hypothesis used to generate the expected frequencies is called the “null
hypothesis”
• The probability value 𝛼𝛼 used to compute the critical value is called the “level of
significance”
• The critical value is computed as the 100 × (1 − 𝛼𝛼)% percentile of the CHISQ
distribution with k=N-1 df

• If T>CV we take this as evidence that the hypothesis used to generate the expected
frequencies is doubtful and the probability that we would get this value of the test
statistic from the data using that hypothesis is low, eg. less than 𝛼𝛼 = 5%. - Reject the
hypothesis

OFFICE I FACULTY I DEPARTMENT


Example
• N=number of categories = 6
• K = degrees of freedom = N-1 = 5
• Null hypothesis: the different categories are all equally likely and have the same
frequency

• Out of 300 rolls of the die the expected frequencies are

• Out of 300 rolls of the die the actual / observed frequencies are

OFFICE I FACULTY I DEPARTMENT


Example
• Calculation of the test statistic

OFFICE I FACULTY I DEPARTMENT


Example
• The 95th percentile of this CHISQ distribution is the critical value CV=11.07049769
• We can compute this percentile using the excel code =CHISQ.INV.RT(5%, 5)
• We can also compute it using the excel code =CHISQ.INV(95%, 5)

• Pvalue
• We can compute Pr(chisq(df = 5) ≤ 5.08) using the excel code = CHISQ.DIST(5.08, 5,
1). The result is 59.38%.
• We can compute Pr(chisq(df = 5) > 5.08) using the excel code = CHISQ.DIST.RT(5.08,
5). The result is 40.62%.

OFFICE I FACULTY I DEPARTMENT


Example
• Test statistic is below the critical value: 5.08 < 11.07. This is not evidence that the null
hypothesis is false.

• OR

• We can compute the probability that the test statistic is at this level or higher
• P-Value = Pr(CHISQ > TS) = Pr(CHISQ > 5.08) = 40.62%
• P-Value is uch higher than the “level of significance” α = 5%
• Hence we don’t find evidence against the null hypothesis as the pvalue is not lower
than 5%

OFFICE I FACULTY I DEPARTMENT


Example

OFFICE I FACULTY I DEPARTMENT


Contingency tables for testing the hypothesis of
independence

• Generalise the previous Chi-square technique to the case where two variables
are involved

OFFICE I FACULTY I DEPARTMENT


Contingency tables for testing the hypothesis of
independence

OFFICE I FACULTY I DEPARTMENT


Contingency tables for testing the hypothesis of
independence

• If the probability that B takes value 1 or 2 is the same regardless of whether A


takes the value 1 or 2, then the variable A and the variable B are independent

OFFICE I FACULTY I DEPARTMENT


Contingency tables for testing the hypothesis of
independence

• The expected frequencies for independence are

OFFICE I FACULTY I DEPARTMENT


Contingency tables for testing the hypothesis of
independence

• The test statistic for testing independence is

• FI,J are observed frequencies and EI,J are expected frequencies.


• If the observed frequencies match the expected ones, then TS = 0.
• If TS is “large”, this is the evidence against the hypothesis of independence
• The TS has the CHISQ distribution with DF = (R-1)(C-1)

OFFICE I FACULTY I DEPARTMENT


Example 5.14

• CHISQ.DIST.RT(TS,
DF)
• p-value = 5.78%

• If α = 5%, there is
insufficient evidence to
conclude that sales
performance depends on
grad status

OFFICE I FACULTY I DEPARTMENT


Yates’ correction factor

• In special cases with some low expected frequency, the chi-square formula often
overestimates the statistical significance of the data. This leads to a conclusion where
H0 is mistakenly rejected.
• Yates’ correction factor is mainly used when at least one cell of the contingency table has
an expected frequency of less than 5.

• In situations with large sample sizes, using the correction will have little effect.

OFFICE I FACULTY I DEPARTMENT


Sample size selection

• When measuring / estimating sample means, sample proportions or other important


metrics / population characteristics, ACCURACY is an issue. Accuracy in turn depends
on the natural variability in the data and on the sample size.

• Select a sample of certain size that meets the requirements of accuracy

• Sampling generally costs in money and in time. The reason for sampling instead of
using the entire population for measurement is mainly due to the costs involved and
the time taken.

OFFICE I FACULTY I DEPARTMENT


Sample size for means

• The length of the confidence interval:

• Question: what value to use for s, the estimated standard deviation of the
data?
OFFICE I FACULTY I DEPARTMENT
Sample size for means

• What value to use for s?


• Typical approach 1 – use the population standard deviation σ if you know it (but
usually you don’t)
• Typical approach 2 – do a small pilot study with n =25 and estimate s from that. For
higher accuracy expand the pilot study to give a more accurate results

• Approach 2 is more likely to be encountered in practice. To apply it we need to


o Have made an estimate of s and
o decided on how confident we want to be in our level of accuracy

OFFICE I FACULTY I DEPARTMENT


Sample size for means

• Define D as the distance from the true value of the mean we want to accept

• So

• E.g. to be 95% confident, select n such that

OFFICE I FACULTY I DEPARTMENT


Example 5.19 from the Textbook:

OFFICE I FACULTY I DEPARTMENT


Sample size for estimating population
proportion π

• For accuracy (/distance) D with confidence 1 – α = 95% (α = 5%):

• Sample size should then be

OFFICE I FACULTY I DEPARTMENT


Example 5.21 from the textbook

OFFICE I FACULTY I DEPARTMENT

You might also like