Chapter 3
Chapter 3
proportion tests
HYPOTHESIS TESTING IN R
Richie Cotton
Data Evangelist at DataCamp
Chapter 1 recap
Is a claim about an unknown population proportion feasible?
Standard error of sample statistic calculated using bootstrap distribution.
Here, we'll calculate the test statistic without using the bootstrap distribution.
HYPOTHESIS TESTING IN R
Standardized test statistic for proportions
p: population proportion (unknown population parameter)
HYPOTHESIS TESTING IN R
Easier standard error calculations
SE(x̄child − x̄adult ) ≈ √
s2child s2adult
+
nchild nadult
SE p^ = √
p0 ∗ (1 − p0 )
n
Assuming H0 is true,
p^ − p0
z=
√
p0 ∗ (1 − p0 )
n
^ and n) and the hypothesized parameter (p0 ).
This only uses sample information ( p
HYPOTHESIS TESTING IN R
Why z instead of t?
(x̄child − x̄adult )
t=
√
s2child s2adult
+
nchild nadult
s is calculated from x̄, so x̄ is used to estimate the population mean and to estimate the
population standard deviation.
HYPOTHESIS TESTING IN R
Stack Overflow age categories
H0 : The proportion of SO users under thirty is equal to 0.5.
stack_overflow %>%
count(age_cat)
# A tibble: 2 x 2
age_cat n
<chr> <int>
1 At least 30 1050
2 Under 30 1216
HYPOTHESIS TESTING IN R
Variables for z
p_hat <- stack_overflow %>%
summarize(prop_under_30 = mean(age_cat == "Under 30")) %>%
pull(prop_under_30)
0.5366
n <- nrow(stack_overflow)
2266
HYPOTHESIS TESTING IN R
Calculating the z-score
p^ − p0
z=
√
p0 ∗ (1 − p0 )
n
3.487
HYPOTHESIS TESTING IN R
Calculating the p-value
Two-tailed ("not equal")
TRUE
p_value <- pnorm(z_score, lower.tail = FALSE)
HYPOTHESIS TESTING IN R
Let's practice!
HYPOTHESIS TESTING IN R
Two-sample
proportion tests
HYPOTHESIS TESTING IN R
Richie Cotton
Data Evangelist at DataCamp
Comparing two proportions
H0 : The proportion of SO users who are hobbyists is the same for those under thirty as those
at least thirty.
H0 : p≥30 − p<30 = 0
HA : The proportion of SO users who are hobbyists is different for those under thirty as those
at least thirty.
HA : p≥30 − p<30 ≠ 0
HYPOTHESIS TESTING IN R
Calculating the z-score
( p^≥30 − p^<30 ) − 0
z=
SE( p^≥30 − p^<30 )
HYPOTHESIS TESTING IN R
Getting the numbers for the z-score
stack_overflow %>% z_score
group_by(age_cat) %>%
summarize( -4.217
p_hat = mean(hobbyist == "Yes"),
n = n()
)
# A tibble: 2 x 3
age_cat p_hat n
<chr> <dbl> <int>
1 At least 30 0.773 1050
2 Under 30 0.843 1216
HYPOTHESIS TESTING IN R
Proportion tests using prop_test()
library(infer)
stack_overflow %>%
prop_test(
hobbyist ~ age_cat, # proportions ~ categories
order = c("At least 30", "Under 30"), # which p-hat to subtract
success = "Yes", # which response value to count proportions of
alternative = "two-sided", # type of alternative hypothesis
correct = FALSE # should Yates' continuity correction be applied?
)
# A tibble: 1 x 6
statistic chisq_df p_value alternative lower_ci upper_ci
<dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 17.8 1 0.0000248 two.sided 0.0605 0.165
HYPOTHESIS TESTING IN R
Let's practice!
HYPOTHESIS TESTING IN R
Chi-square test of
independence
HYPOTHESIS TESTING IN R
Richie Cotton
Data Evangelist at DataCamp
Revisiting the proportion test
library(infer)
stack_overflow %>%
prop_test(
hobbyist ~ age_cat,
order = c("At least 30", "Under 30"),
alternative = "two-sided",
correct = FALSE
)
# A tibble: 1 x 6
statistic chisq_df p_value alternative lower_ci upper_ci
<dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 17.8 1 0.0000248 two.sided 0.0605 0.165
HYPOTHESIS TESTING IN R
Independence of variables
Previous hypothesis test result: there is evidence that the hobbyist and age_cat variables
have an association.
If the proportion of successes in the response variable is the same across all categories of the
explanatory variable, the two variables are statistically independent.
1 Response and explanatory variables are defined in "Introduction to Regression in R", Chapter 1.
HYPOTHESIS TESTING IN R
Job satisfaction and age category
stack_overflow %>% stack_overflow %>%
count(age_cat) count(job_sat)
# A tibble: 2 x 2 # A tibble: 5 x 2
age_cat n job_sat n
<chr> <int> <fct> <int>
1 At least 30 1050 1 Very dissatisfied 159
2 Under 30 1211 2 Slightly dissatisfied 342
3 Neither 201
4 Slightly satisfied 680
5 Very satisfied 879
HYPOTHESIS TESTING IN R
Declaring the hypotheses
H0 : Age categories are independent of job satisfaction levels.
Assuming independence, how far away are the observed results from the expected values?
HYPOTHESIS TESTING IN R
Exploratory visualization: proportional stacked bar plot
ggplot(stack_overflow, aes(job_sat, fill = age_cat)) +
geom_bar(position = "fill") +
ylab("proportion")
HYPOTHESIS TESTING IN R
Chi-square independence test using chisq_test()
library(infer)
stack_overflow %>%
chisq_test(age_cat ~ job_sat)
# A tibble: 1 x 3
statistic chisq_df p_value
<dbl> <int> <dbl>
1 5.55 4 0.235
Degrees of freedom:
(2 − 1) ∗ (5 − 1) = 4
HYPOTHESIS TESTING IN R
Swapping the variables?
ggplot(stack_overflow, aes(age_cat, fill = job_sat)) +
geom_bar(position = "fill") +
ylab("proportion")
HYPOTHESIS TESTING IN R
chi-square both ways
library(infer) library(infer)
stack_overflow %>% stack_overflow %>%
chisq_test(age_cat ~ job_sat) chisq_test(job_sat ~ age_cat)
# A tibble: 1 x 3 # A tibble: 1 x 3
statistic chisq_df p_value statistic chisq_df p_value
<dbl> <int> <dbl> <dbl> <int> <dbl>
1 5.55 4 0.235 1 5.55 4 0.235
Ask Not
HYPOTHESIS TESTING IN R
What about direction and tails?
args(chisq_test)
1Left-tailed chi-square tests are used in statistical forensics to detect is a fit is suspiciously good because the
data was fabricated. Chi-square tests of variance can be two-tailed. These are niche uses though.
HYPOTHESIS TESTING IN R
Let's practice!
HYPOTHESIS TESTING IN R
Chi-square
goodness of fit tests
HYPOTHESIS TESTING IN R
Richie Cotton
Data Evangelist at DataCamp
Purple links
You search for a coding solution online and the first result link is purple because you
already visited it. How do you feel?
# A tibble: 4 x 2
purple_link n
<fct> <int>
1 Hello, old friend 1330
2 Amused 409
3 Indifferent 426
4 Annoyed 290
HYPOTHESIS TESTING IN R
Declaring the hypotheses
hypothesized <- tribble( # A tibble: 4 x 2
~ purple_link, ~ prop, purple_link prop
"Hello, old friend", 1 / 2, <chr> <dbl>
"Amused" , 1 / 6, 1 Hello, old friend 0.5
"Indifferent" , 1 / 6, 2 Amused 0.167
"Annoyed" , 1 / 6 3 Indifferent 0.167
) 4 Annoyed 0.167
2
H0 : The sample matches with the The test statistic, χ , measures how far
hypothesized distribution. observed results are from expectations in
each group.
HA : The sample does not match with the
hypothesized distribution. alpha <- 0.01
1 tribble is short for "row-wise tibble"; not to be confused with the alien species from Star Trek
HYPOTHESIS TESTING IN R
Hypothesized counts by category
n_total <- nrow(stack_overflow) # A tibble: 4 x 3
hypothesized <- tribble( purple_link prop n
~ purple_link, ~ prop, <chr> <dbl> <dbl>
"Hello, old friend", 1 / 2, 1 Hello, old friend 0.5 1228.
"Amused" , 1 / 6, 2 Amused 0.167 409.
"Indifferent" , 1 / 6, 3 Indifferent 0.167 409.
"Annoyed" , 1 / 6 4 Annoyed 0.167 409.
) %>%
mutate(n = prop * n_total)
HYPOTHESIS TESTING IN R
Visualizing counts
ggplot(purple_link_counts, aes(purple_link, n)) +
geom_col() +
geom_point(data = hypothesized, color = "purple")
HYPOTHESIS TESTING IN R
chi-square goodness of fit test using chisq_test()
hypothesized_props <- c( library(infer)
"Hello, old friend" = 1 / 2, stack_overflow %>%
Amused = 1 / 6, chisq_test(
Indifferent = 1 / 6, response = purple_link,
Annoyed = 1 / 6 p = hypothesized_props
) )
# A tibble: 1 x 3
statistic chisq_df p_value
<dbl> <dbl> <dbl>
1 44.0 3 0.00000000154
HYPOTHESIS TESTING IN R
Let's practice!
HYPOTHESIS TESTING IN R