Linear Correlation and Linear Regression + Summary of Tests
Linear Correlation and Linear Regression + Summary of Tests
( x X )( y
i i Y )
cov ( x , y ) i 1
n 1
Interpreting Covariance
cov(X,Y) > 0 X and Y are positively correlated
cov ariance( x, y )
r
var x var y
Correlation
Measures the relative strength of the linear
relationship between two variables
Unit-less
Ranges between –1 and 1
The closer to –1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship
The closer to 0, the weaker any positive linear relationship
Scatter Plots of Data with
Various Correlation Coefficients
Y Y Y
X X X
r = -1 r = -.6 r=0
Y
Y Y
X X X
r = +1 r = +.3 r=0
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
Linear relationships Curvilinear relationships
Y Y
X X
Y Y
X X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
No relationship
X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Calculating by hand…
n
( x x )( y
i 1
i i y)
cov ariance( x, y ) n 1
rˆ
var x var y n n
i
( x
i 1
x ) 2
i
( y
i 1
y ) 2
n 1 n 1
Simpler calculation formula…
Numerator of
n covariance
( x x )( y
i 1
i i y)
rˆ n 1
n n SS xy
(x x) ( y
i
2
i y)2 rˆ
i 1 i 1 SS x SS y
n 1 n 1
n
( x x )( y
i i y)
SS xy
Numerators of
variance
i 1
n n
SS x SS y
(x x) ( y
i 1
i
2
i 1
i y) 2
Distribution of the
correlation coefficient:
2
1 r
SE (rˆ)
n2
The sample correlation coefficient follows a T-distribution
with n-2 degrees of freedom (since you have to estimate the
standard error).
B
What’s Slope?
E ( yi / xi ) xi
Predicted value for an
individual…
yi= + *xi + random errori
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Regression Picture
yi
ŷi xi
C A
B
y A
B y
C
yi
1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged
and older European men. J Neurol Neurosurg Psychiatry. 2009 Jul;80(7):722-9.
Distribution of vitamin D
Mean= 63 nmol/L
Standard deviation = 33 nmol/L
Distribution of DSST
Normally distributed
Mean = 28 points
Standard deviation = 10 points
Four hypothetical datasets
I generated four hypothetical datasets,
with increasing TRUE slopes (between
vit D and DSST):
0
0.5 points per 10 nmol/L
1.0 points per 10 nmol/L
1.5 points per 10 nmol/L
Dataset 1: no relationship
Dataset 2: weak relationship
Dataset 3: weak to moderate
relationship
Dataset 4: moderate
relationship
The “Best fit” line
Regression
equation:
E(Yi) = 28 + 0*vit
Di (in 10 nmol/L)
The “Best fit” line
Regression
equation:
E(Yi) = 26 + 0.5*vit
Di (in 10 nmol/L)
The “Best fit” line
Regression equation:
E(Yi) = 22 + 1.0*vit
Di (in 10 nmol/L)
The “Best fit” line
Regression equation:
E(Yi) = 20 + 1.5*vit Di
(in 10 nmol/L)
What’s the constraint? We are trying to minimize the squared distance (hence the “least squares”) between the
observations themselves and the predicted values , or (also called the “residuals”, or left-over unexplained
variability)
Find the β that gives the minimum sum of the squared differences. How do you maximize a function? Take the
n n
d to zero;
(y
derivative; set it equal and solve. Typical max/min problem from calculus….
(y i ( xi )) 2(
2
i xi )( xi ))
d i 1 i 1
n
2(
i 1
2
( y i xi x i xi )) 0...
ˆ Cov( x, y )
Slope (beta coefficient) =
Var ( x)
SDx
rˆ ̂
SD y
In correlation, the two variables are treated as equals. In regression, one variable is considered
independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.
Example: dataset 4
SDx = 33 nmol/L
SDy= 10 points
Cov(X,Y) = 163
points*nmol/L
Beta = 163/332 = 0.15
SS x
points per nmol/L
ˆ
SS y
= 1.5 points per 10
nmol/L
r = 163/(10*33) = 0.49
Or
r = 0.15 * (33/10) = 0.49
Significance testing…
Slope
Distribution of slope ~ Tn-2(β,s.e.( ))
ˆ
Tn-2=
ˆ 0
s.e.( ˆ )
Formula for the standard error of
beta (you will not have to calculate
by hand!):
n
(y
i 1
i yˆ i ) 2
2
n2 sy / x
sˆ
SS x SS x
n
where SSx ( xi x ) 2
i 1
and yˆ i ˆ ˆxi
Example: dataset 4
Standard error (beta) = 0.03
T98 = 0.15/0.03 = 5, p<.0001
yˆ i 20 1.5 xi
For Vitamin D = 95 nmol/L (or 9.5 in 10 nmol/L):
yˆ i 20 1.5(9.5) 34
Residual =
observed - predicted
X=95
nmol/L
yi 48
34
yˆ i 34
yi yˆ i 14
Residual Analysis for
Linearity
Y Y
x x
residuals
x residuals x
Not Linear
Linear
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Residual Analysis for
Homoscedasticity
Y Y
x x
residuals
x residuals x
Non-constant variance
Constant variance
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Residual Analysis for
Independence
Not Independent
Independent
residuals
residuals
X
residuals
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Residual plot, dataset 4
Multiple linear regression…
What if age is a confounder here?
Older men have lower vitamin D
Older men have poorer cognition
“Adjust” for age by putting age in the
model:
DSST score = intercept + slope1xvitamin D
+ slope2 xage
2 predictors: age and vit D…
Different 3D view…
Fit a plane rather than a line…
40 32.5 7.5
T98 3.46; p .0008
2 2
10.8 10.8
54 46
As a linear regression…
Parameter ````````````````Standard
Variable Estimate Error t Value Pr > |t|
Sufficient vs.
Deficient
Results…
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Interpretation:
The deficient group has a mean DSST 9.87 points
lower than the reference (sufficient) group.
The insufficient group has a mean DSST 6.87
points lower than the reference (sufficient) group.
Multivariate regression pitfalls
Multi-collinearity
Residual confounding
Overfitting
Multicollinearity
Multicollinearity arises when two variables that
measure the same thing or similar things (e.g.,
weight and BMI) are both included in a multiple
regression model; they will, in effect, cancel each
other out and generally destroy your model.
Sinha R, Cross AJ, Graubard BI, Leitzmann MF, Schatzkin A. Meat intake and mortality: a prospective study of over half a million people. Arch Intern Med
Mortality risks…
Sinha R, Cross AJ, Graubard BI, Leitzmann MF, Schatzkin A. Meat intake and mortality: a prospective study of over half a million people. Arch Intern Med
Overfitting
In multivariate modeling, you can get
highly significant but meaningless
results if you put too many predictors in
the model.
The model is fit perfectly to the quirks
of your particular sample, but has no
predictive ability in a new sample.
Overfitting: class data
example
I asked SAS to automatically find
predictors of optimism in our class
dataset. Here’s the resulting linear
regression model:
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Exercise, sleep, and high ratings for Clinton are negatively related to optimism (highly
significant!) and high ratings for Obama and high love of math are positively related to
optimism (highly significant!).
If something seems to good to
be true…
Clinton, univariate:
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Obama, Univariate:
Parameter Standard Compare
Variable Label DF Estimate Error t Value Pr > |t| with
multivariate
Intercept Intercept 1 0.82107 2.43137 0.34 0.7389 result;
obama obama 1 0.87276 0.31973 2.73 0.0126 p<.0001
Binary High blood Logistic ln (odds of high blood pressure) = odds ratios—tells you
pressure regression + salt*salt consumption how much the odds of the
(yes/no) (tsp/day) + age*age (years) + outcome increase for
smoker*ever smoker every 1-unit increase in
(yes=1/no=0) each predictor.
Continuous Ttest: compares means Paired ttest: compares means Non-parametric statistics
(e.g. pain between two independent groups between two related groups (e.g., Wilcoxon sign-rank test:
the same subjects before and non-parametric alternative to the
scale, after)
ANOVA: compares means paired ttest
cognitive
function) between more than two
independent groups Repeated-measures Wilcoxon sum-rank test
ANOVA: compares changes (=Mann-Whitney U test): non-
Pearson’s correlation over time in the means of two or parametric alternative to the ttest
more groups (repeated
coefficient (linear measurements)
correlation): shows linear Kruskal-Wallis test: non-
correlation between two parametric alternative to ANOVA
continuous variables Mixed models/GEE
modeling: multivariate
regression techniques to compare Spearman rank correlation
Linear regression: changes over time between two or coefficient: non-parametric
multivariate regression technique more groups; gives rate of change alternative to Pearson’s correlation
used when the outcome is over time coefficient
Binary or categorical outcomes
(proportions); HRP 259/HRP 261
Are the observations correlated? Alternative to the chi-
Outcome independent correlated square test if sparse
Variable cells:
Binary or Chi-square test: McNemar’s chi-square test: Fisher’s exact test: compares
categorical compares proportions between compares binary outcome between proportions between independent
more than two groups correlated groups (e.g., before and groups when there are sparse data
(e.g. after) (some cells <5).
fracture,
yes/no) Relative risks: odds ratios
or risk ratios Conditional logistic McNemar’s exact test:
regression: multivariate compares proportions between
regression technique for a binary correlated groups when there are
Logistic regression: sparse data (some cells <5).
outcome when groups are correlated
multivariate technique used
(e.g., matched data)
when outcome is binary; gives
multivariate-adjusted odds
ratios GEE modeling: multivariate
regression technique for a binary
outcome when groups are correlated
(e.g., repeated measures)
Time-to-event outcome
(survival data); HRP 262
Are the observation groups independent or correlated? Modifications to
Outcome Cox regression
independent correlated
Variable if proportional-
hazards is
violated:
Time-to- Kaplan-Meier statistics: estimates survival functions for n/a (already over Time-dependent
event (e.g., each group (usually displayed graphically); compares survival time) predictors or time-
functions with log-rank test dependent hazard
time to
ratios (tricky!)
fracture)
Cox regression: Multivariate technique for time-to-event data;
gives multivariate-adjusted hazard ratios