Chapter 2
Chapter 2
James Chapman
Curriculum Manager, DataCamp
Two-sample problems
Compare sample statistics across groups of a variable
converted_comp is a numerical variable
Are users who first programmed as a child compensated higher than those that started as
adults?
H0 : μchild = μadult
H0 : μchild − μadult = 0
HA : The mean compensation (in USD) is greater for those that coded first as a child
compared to those that coded first as an adult.
age_first_code_cut
adult 111313.311047
child 132419.570621
Name: converted_comp, dtype: float64
x̄ - a sample mean
x̄child - sample mean compensation for coding first as a child
x̄adult - sample mean compensation for coding first as an adult
x̄child − x̄adult - a test statistic
z-score - a (standardized) test statistic
adult 111313.311047
child 132419.570621
Name: converted_comp, dtype: float64 age_first_code_cut
s = stack_overflow.groupby('age_first_code_cut')['converted_comp'].std()
adult 271546.521729
child 255585.240115
Name: converted_comp, dtype: float64 age_first_code_cut
n = stack_overflow.groupby('age_first_code_cut')['converted_comp'].count()
adult 1376
child 885
Name: converted_comp, dtype: int64
import numpy as np
numerator = xbar_child - xbar_adult
denominator = np.sqrt(s_child ** 2 / n_child + s_adult ** 2 / n_adult)
t_stat = numerator / denominator
1.8699313316221844
James Chapman
Curriculum Manager, DataCamp
t-distributions
t statistic follows a t-distribution
Have a parameter named degrees of
freedom, or df
Look like normal distributions, with fatter
tails
df = nchild + nadult − 2
HA : The mean compensation (in USD) is greater for those that coded first as a child
compared to those that coded first as an adult
If p ≤ α then reject H0 .
SE(x̄child − x̄adult ) ≈ √
s2child s2adult
+
nchild nadult
z-statistic: needed when using one sample statistic to estimate a population parameter
t-statistic: needed when using multiple sample statistics to estimate a population parameter
1.8699313316221844
2259
0.030811302165157595
Evidence that Stack Overflow data scientists who started coding as a child earn more.
James Chapman
Curriculum Manager, DataCamp
US Republican presidents dataset
state county repub_percent_08 repub_percent_12
0 Alabama Hale 38.957877 37.139882
1 Arkansas Nevada 56.726272 58.983452
2 California Lake 38.896719 39.331367
3 California Ventura 42.923190 45.250693
.. ... ... ... ...
96 Wisconsin La Crosse 37.490904 40.577038
97 Wisconsin Lafayette 38.104967 41.675050
98 Wyoming Weston 76.684241 83.983328
99 Alaska District 34 77.063259 40.789626
1 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ
H0 : μ2008 − μ2012 = 0
-2.877109041242944
df = ndif f − 1
New hypotheses:
H0 : μdiff = 0
HA : μdiff < 0
degrees_of_freedom = n_diff - 1
9.572537285272411e-08
99
BF10 power
T-test 1.323e+05 1.0
1Details on Returns from pingouin.ttest() are available in the API docs for pingouin at https://pingouin-
stats.org/generated/pingouin.ttest.html#pingouin.ttest.
BF10 power
T-test 1.323e+05 0.696338
power
T-test 0.454972
Unpaired t-tests on paired data increases the chances of false negative errors
James Chapman
Curriculum Manager, DataCamp
Job satisfaction: 5 categories
stack_overflow['job_sat'].value_counts()
alpha = 0.2
pingouin.anova(data=stack_overflow,
dv="converted_comp",
between="job_sat")
0.001315 <α
At least two categories have significantly different compensation