Introductory HIV/AIDS Data Analysis
Workshop
Pham Ngoc Thach University of Medicine
Rishi Chakraborty1,2
1Center for AIDS Research, Duke University, North Carolina, USA
2Department of Biostatistics and Bioinformatics, Duke University, North Carolina, USA
March 19-20, 2024
Biostatics Review
1
MODELS
Relate a dependent variable,outcome,
or response,Y, to some other variable(s).
These other variable(s) are called
s
independent or regressor variables, X’ .
LINEAR MODELS
Y is Normally distributed. Y ~ N( µ , σ2 )
2
If all the independent variables are numeric,
we have regression.
If all the independent variables are,
categorical, we have analysis of variance.
If we have both types of independent
variables, we have analysis of covariance.
LINEAR MODELS
3
Exceptions
Correlation - not a model at all
Logistic regression - dichotomous outcome
Poisson regression - count outcome
4
The sciences do not try to explain, they hardly
even try to interpret, they mainly make
models. By a model is meant a mathematical
construct which, with the addition of certain
verbal interpretations, describes observed
phenomena. The justification of such a
mathematical construct is solely and precisely
that it is expected to work.
John Von Neumann
5
Population (parameters, N obs.)
Inferential
Probability
Statistics
Sample (statistics, n obs.)
6
Statistics
n
∑ Xi
Sample mean, X = i=1
n
Sample median n
2
∑ (X i - X) 2
Sample variance, s = i=1
(n -1)
2
Sample standard deviation, s = s 7
Probability
0 < p < 1
Discrete Distributions
Binomial - number of successes in n trials
Poisson - number of events in an interval
8
Continuous Distributions
Normal Y ~ N( µ , σ2 )
µ−σ µ µ+σ 9
A standard Normal distribution is one where
µ = 0 and σ2 = 1. This is denoted by Z.
Z ~ N(0 , 1)
-3 -2 -1 0 1 2 3 10
Table A of the statistical tables gives cumulative
probabilities for a standard Normal distribution.
P(Z < 1.27)
-3 -2 -1 0 1 2 3 11
1.27
Table A (continued)
Cumulative Probabilities for the Standard Normal (Z) Distribution
Z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09 Z
0.00 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359 0.00
0.10 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753 0.10
0.20 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141 0.20
0.30 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517 0.30
0.40 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879 0.40
0.50 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224 0.50
0.60 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549 0.60
0.70 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852 0.70
0.80 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133 0.80
0.90 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389 0.90
1.00 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621 1.00
1.10 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830 1.10
1.20 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015 1.20
1.30 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177 1.30
1.40 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319 1.40
1.50 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441 1.50
12
Table A of the statistical tables gives cumulative
probabilities for a standard Normal distribution.
P(Z < 1.27)
= .8980
-3 -2 -1 0 1 2 3 13
1.27
For other Normal distributions, we can convert
to a standard Normal by standardizing.
Y-µ
Z = ~ N(0 , 1)
σ
Y = diastolic blood pressure Y ~ N(77 , 11.62)
60 - 77
P(Y < 60) = P Z <
11.6
= P(Z < -1.47) = .0708 14
Other Distributions
t
- one parameter, called the df
- similar to a Z, but with “fatter tails”
- specific percentiles are in Table B
t(12),.95
15
Table B
Percentiles of the t-Distribution
df t.60 t.70 t.80 t.90 t.95 t.975 t.99 t.995 t.9995
1 0.325 0.727 1.376 3.078 6.314 12.706 31.821 63.657 636.619
2 0.289 0.617 1.061 1.886 2.920 4.303 6.965 9.925 31.599
3 0.277 0.584 0.978 1.638 2.353 3.182 4.541 5.841 12.924
4 0.271 0.569 0.941 1.533 2.132 2.776 3.747 4.604 8.610
5 0.267 0.559 0.920 1.476 2.015 2.571 3.365 4.032 6.869
6 0.265 0.553 0.906 1.440 1.943 2.447 3.143 3.707 5.959
7 0.263 0.549 0.896 1.415 1.895 2.365 2.998 3.499 5.408
8 0.262 0.546 0.889 1.397 1.860 2.306 2.896 3.355 5.041
9 0.261 0.543 0.883 1.383 1.833 2.262 2.821 3.250 4.781
10 0.260 0.542 0.879 1.372 1.812 2.228 2.764 3.169 4.587
11 0.260 0.540 0.876 1.363 1.796 2.201 2.718 3.106 4.437
12 0.259 0.539 0.873 1.356 1.782 2.179 2.681 3.055 4.318
13 0.259 0.538 0.870 1.350 1.771 2.160 2.650 3.012 4.221
14 0.258 0.537 0.868 1.345 1.761 2.145 2.624 2.977 4.140
15 0.258 0.536 0.866 1.341 1.753 2.131 2.602 2.947 4.073
16
Other Distributions
t
- one parameter, called the df
- similar to a Z, but with “fatter tails”
- specific percentiles are in Table B
t(12),.95 = 1.782
For “lower tail” values, t(df),α = -t(df),1-α 17
χ2
- one parameter, called the df
- specific percentiles are in Table C
F
- two parameter, called the numerator df
and the denominator df
- specific percentiles are in Tables D1 – D3
18
Sampling Distributions
The mean of a sampling distribution is called
the expected value of the statistic.
The standard deviation of a sampling distribution
is called the standard error of the statistic.
19
Sampling Distribution of X
E(X) = µ
σ2 σ
Var(X) = ⇒ s.e.(X) =
n n
2
σ
If X ~ N(µ , σ ), then X ~ N µ ,
2
n
X -µ
⇒ ~ N ( 0 , 1)
σ/ n 20
Central Limit Theorem
For n sufficiently large, the sampling
distribution of X is at least approximately
Normal for any underlying distribution!
X -µ
~ N ( 0 , 1)
σ/ n
21
Statistical Inference
- Estimation
- Hypothesis Testing
A point estimate is a single statistic that is
used to estimate a population parameter.
We can also estimate a parameter by a
100(1-α)% confidence interval.
This has a probability of “capture” of (1-α).
22
µ
( )
( )
( )
( )
( ) 100(1-α)% of
( ) these intervals
. will capture the
.
. parameter (µ)
.
.
( )
23
Form of most confidence intervals:
point estimate ± (table value)(std. error)
A 100(1-α)% C. I. for µ is:
(
X ± t (n−1),1−α/2 ) s
n
24
Hypothesis testing
Test a null hypothesis, H0,
against an alternative hypothesis, H1.
Two possible decisions:
- Reject H0 (in favor of H1)
- Fail to reject H0
25
TRUTH
DECISION H0 true H1 true
Reject H0 Type I error correct
Fail to reject H0 correct Type II error
α = P(Type I error) = P(Reject H0 | H0 true)
α is the significance level of the test
β = P(Type II error) = P(Fail to reject H0 | H1 true)
Power = P(Reject H0 | H1 true) = 1 - β 26
p - values
The probability of getting a test statistic at
least as “extreme” (in the direction stated
by H1) as the one observed.
Reject H0 if the p-value < α.
27
Hypothesis Testing Steps
1) Determine hypotheses
2) Decide on α ( .01 , .05 , .10 )
3 & 4) State rejection region, calculate test statistic
(or)
Calculate test statistic and p-value
5) Make decision (reject or not reject)
6) Write conclusions (interpret results),
in the context of the problem 28