0% found this document useful (0 votes)

11 views

Linear Correlation and Linear Regression + Summary of Tests

Uploaded by

ghosh71

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Linear Correlation and Linear Regression + Summary of Tests

Uploaded by

ghosh71

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 77

Linear correlation and linear

regression + summary of tests

Recall: Covariance

 ( x  X )( y
i i Y )
cov ( x , y )  i 1
n 1
Interpreting Covariance
cov(X,Y) > 0 X and Y are positively correlated

cov(X,Y) < 0 X and Y are inversely correlated

cov(X,Y) = 0 X and Y are independent

Correlation coefficient
 Pearson’s Correlation Coefficient is standardized covariance (unitless):

cov ariance( x, y )
r
var x var y
Correlation
 Measures the relative strength of the linear
relationship between two variables
 Unit-less
 Ranges between –1 and 1
 The closer to –1, the stronger the negative linear relationship
 The closer to 1, the stronger the positive linear relationship
 The closer to 0, the weaker any positive linear relationship
Scatter Plots of Data with
Various Correlation Coefficients
Y Y Y

X X X
r = -1 r = -.6 r=0
Y
Y Y

X X X
r = +1 r = +.3 r=0
 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X
 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
Strong relationships Weak relationships

Y Y

X X

Y Y

X X
 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
No relationship

X
 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Calculating by hand…
n

 ( x  x )( y
i 1
i i  y)
cov ariance( x, y ) n 1
rˆ  
var x var y n n

 i
( x
i 1
 x ) 2
 i
( y
i 1
 y ) 2

n 1 n 1
Simpler calculation formula…
Numerator of
n covariance
 ( x  x )( y
i 1
i i  y)

rˆ  n 1 
n n SS xy
 (x  x)  ( y
i
2
i  y)2 rˆ 
i 1 i 1 SS x SS y
n 1 n 1
n

 ( x  x )( y
i i  y)
SS xy
Numerators of
variance
i 1

n n
SS x SS y
 (x  x)  ( y
i 1
i
2

i 1
i  y) 2
Distribution of the
correlation coefficient:

2
1 r
SE (rˆ) 
n2
The sample correlation coefficient follows a T-distribution
with n-2 degrees of freedom (since you have to estimate the
standard error).

*note, like a proportion, the variance of the correlation coefficient depends

on the correlation coefficient itselfsubstitute in estimated r
Linear regression

In correlation, the two variables are treated as equals. In regression, one

variable is considered independent (=predictor) variable (X) and the other the
dependent (=outcome) variable Y.
What is “Linear”?
 Remember this:
 Y=mX+B?

B
What’s Slope?

A slope of 2 means that every 1-unit change in X

yields a 2-unit change in Y.
Prediction
If you know something about X, this knowledge helps you
predict something about Y. (Sound familiar?…sound
like conditional probabilities?)
Regression equation…
Expected value of y at a given level of x=

E ( yi / xi )     xi
Predicted value for an
individual…
yi=  + *xi + random errori

Fixed – Follows a normal

exactly distribution
on the
line
Assumptions (or the fine print)
 Linear regression assumes that…
 1. The relationship between X and Y is linear
 2. Y is distributed normally at each value of X
 3. The variance of Y at every value of X is the
same (homogeneity of variances)
 4. The observations are independent
The standard error of Y given X is the average variability around
the regression line at any given value of X. It is assumed to be equal
at all values of X.

Sy/x

Sy/x
Sy/x
Sy/x
Sy/x

Sy/x
Regression Picture
yi
ŷi  xi  
C A

B
y A
B y
C
yi

*Least squares estimation

x gave us the line (β) that
n n n minimized C2
(y
i 1
i  y) 2
  ( yˆ
i 1
i  y) 2
  ( yˆ
i 1
i  y i ) 2
R2=SSreg/SStotal
A2
B 2
C 2

SStotal SSreg SSresidual

Total squared distance of Distance from regression line to naïve mean of y Variance around the regression line
observations from naïve mean Variability due to x (regression) Additional variability not explained by
of y x—what least squares method aims to
Total variation minimize
Recall example: cognitive
function and vitamin D
 Hypothetical data loosely based on [1];
cross-sectional study of 100 middle-
aged and older European men.
 Cognitive function is measured by the Digit
Symbol Substitution Test (DSST).

1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged
and older European men. J Neurol Neurosurg Psychiatry. 2009 Jul;80(7):722-9.
Distribution of vitamin D

Mean= 63 nmol/L
Standard deviation = 33 nmol/L
Distribution of DSST
Normally distributed
Mean = 28 points
Standard deviation = 10 points
Four hypothetical datasets
 I generated four hypothetical datasets,
with increasing TRUE slopes (between
vit D and DSST):
 0
 0.5 points per 10 nmol/L
 1.0 points per 10 nmol/L
 1.5 points per 10 nmol/L
Dataset 1: no relationship
Dataset 2: weak relationship
Dataset 3: weak to moderate
relationship
Dataset 4: moderate
relationship
The “Best fit” line

Regression
equation:
E(Yi) = 28 + 0*vit
Di (in 10 nmol/L)
The “Best fit” line

Note how the line is

a little deceptive; it
draws your eye,
making the
relationship appear
stronger than it
really is!

Regression
equation:
E(Yi) = 26 + 0.5*vit
Di (in 10 nmol/L)
The “Best fit” line

Regression equation:
E(Yi) = 22 + 1.0*vit
Di (in 10 nmol/L)
The “Best fit” line

Regression equation:
E(Yi) = 20 + 1.5*vit Di
(in 10 nmol/L)

Note: all the lines go

through the point
(63, 28)!
Estimating the intercept and
slope: least squares estimation
** Least Squares Estimation
A little calculus….
What are we trying to estimate? β, the slope, from

What’s the constraint? We are trying to minimize the squared distance (hence the “least squares”) between the
observations themselves and the predicted values , or (also called the “residuals”, or left-over unexplained
variability)

Differencei = yi – (βx + α) Differencei2 = (yi – (βx + α)) 2

Find the β that gives the minimum sum of the squared differences. How do you maximize a function? Take the
n n
d to zero;
 (y
derivative; set it equal and solve. Typical max/min problem from calculus….
(y i  ( xi   ))  2(
2
i  xi   )( xi ))
d i 1 i 1
n
2( 
i 1
2
( y i xi  x i  xi ))  0...

From here takes a little math trickery to solve for β…

Resulting formulas…

ˆ Cov( x, y )
Slope (beta coefficient) =

Var ( x)

Intercept= Calculate : ˆ  y - ˆx

Regression line always goes through the point: (x, y)

Relationship with correlation

SDx
rˆ  ̂
SD y
In correlation, the two variables are treated as equals. In regression, one variable is considered
independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.
Example: dataset 4
SDx = 33 nmol/L
SDy= 10 points
Cov(X,Y) = 163
points*nmol/L
Beta = 163/332 = 0.15
SS x
points per nmol/L
ˆ
SS y
= 1.5 points per 10
nmol/L

r = 163/(10*33) = 0.49
Or
r = 0.15 * (33/10) = 0.49
Significance testing…
Slope
Distribution of slope ~ Tn-2(β,s.e.( ))

ˆ

H0: β1 = 0 (no linear relationship)

H1: β1 0 (linear relationship does exist)

Tn-2=
ˆ  0
s.e.( ˆ )
Formula for the standard error of
beta (you will not have to calculate
by hand!):
n

(y
i 1
i  yˆ i ) 2

2
n2 sy / x
sˆ  
SS x SS x

n
where SSx   ( xi  x ) 2
i 1

and yˆ i  ˆ  ˆxi
Example: dataset 4
 Standard error (beta) = 0.03
 T98 = 0.15/0.03 = 5, p<.0001

 95% Confidence interval = 0.09 to 0.21

Residual Analysis: check
assumptions
ei  Yi  Yˆi
 The residual for observation i, ei, is the difference
between its observed and predicted value
 Check the assumptions of regression by examining the
residuals
 Examine for linearity assumption
 Examine for constant variance for all levels of X
(homoscedasticity)
 Evaluate normal distribution assumption
 Evaluate independence assumption

 Graphical Analysis of Residuals

 Can plot residuals vs. X
Predicted values…

yˆ i  20  1.5 xi
For Vitamin D = 95 nmol/L (or 9.5 in 10 nmol/L):

yˆ i  20  1.5(9.5)  34
Residual =
observed - predicted

X=95
nmol/L
yi  48
34
yˆ i  34
yi  yˆ i  14
Residual Analysis for
Linearity
Y Y

x x
residuals

x residuals x


Not Linear
 Linear
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Residual Analysis for
Homoscedasticity

Y Y

x x
residuals

x residuals x

Non-constant variance
 Constant variance

 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Residual Analysis for
Independence

Not Independent
 Independent
residuals

residuals
X
residuals

 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Residual plot, dataset 4
Multiple linear regression…
 What if age is a confounder here?
 Older men have lower vitamin D
 Older men have poorer cognition
 “Adjust” for age by putting age in the
model:
 DSST score = intercept + slope1xvitamin D
+ slope2 xage
2 predictors: age and vit D…
Different 3D view…
Fit a plane rather than a line…

On the plane, the

slope for vitamin
D is the same at
every age; thus,
the slope for
vitamin D
represents the
effect of vitamin
D when age is
held constant.
Equation of the “Best fit”
plane…
 DSST score = 53 + 0.0039xvitamin D
(in 10 nmol/L) - 0.46 xage (in years)

 P-value for vitamin D >>.05

 P-value for age <.0001

 Thus, relationship with vitamin D was

due to confounding by age!
Multiple Linear Regression
 More than one predictor…

E(y)=  + 1X + 2 W + 3 *Z…

Each regression coefficient is the amount of change

in the outcome variable that would be expected
per one-unit change of the predictor, if all other
variables in the model were held constant.

Functions of multivariate
analysis:
 Control for confounders
 Test for interactions between predictors
(effect modification)
 Improve predictions
A ttest is linear regression!
 Divide vitamin D into two groups:
 Insufficient vitamin D (<50 nmol/L)
 Sufficient vitamin D (>=50 nmol/L), reference
group
 We can evaluate these data with a ttest or a
linear regression…

40  32.5  7.5
T98   3.46; p  .0008
2 2
10.8 10.8

54 46
As a linear regression…

Intercept Slope represents

represents the the difference in
mean value in means between the
the sufficient groups. Difference
group. is significant.

Parameter ````````````````Standard
Variable Estimate Error t Value Pr > |t|

Intercept 40.07407 1.47511 27.17 <.0001

insuff -7.53060 2.17493 -3.46 0.0008
ANOVA is linear regression!
 Divide vitamin D into three groups:
 Deficient (<25 nmol/L)
 Insufficient (>=25 and <50 nmol/L)
 Sufficient (>=50 nmol/L), reference group

DSST=  (=value for sufficient) + insufficient*(1 if

insufficient) + 2 *(1 if deficient)
This is called “dummy coding”—where multiple binary
variables are created to represent being in each
category (or not) of a categorical variable
The picture…
Sufficient vs.
Insufficient

Sufficient vs.
Deficient
Results…
Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 40.07407 1.47817 27.11 <.0001

deficient 1 -9.87407 3.73950 -2.64 0.0096
insufficient 1 -6.87963 2.33719 -2.94 0.0041

 Interpretation:
 The deficient group has a mean DSST 9.87 points
lower than the reference (sufficient) group.
 The insufficient group has a mean DSST 6.87
points lower than the reference (sufficient) group.
Multivariate regression pitfalls
 Multi-collinearity
 Residual confounding
 Overfitting
Multicollinearity
 Multicollinearity arises when two variables that
measure the same thing or similar things (e.g.,
weight and BMI) are both included in a multiple
regression model; they will, in effect, cancel each
other out and generally destroy your model.

 Model building and diagnostics are tricky

business!
Residual confounding
 You cannot completely wipe out
confounding simply by adjusting for
variables in multiple regression unless
variables are measured with zero error
(which is usually impossible).
 Example: meat eating and mortality
Men who eat a lot of meat are
unhealthier for many reasons!

Sinha R, Cross AJ, Graubard BI, Leitzmann MF, Schatzkin A. Meat intake and mortality: a prospective study of over half a million people. Arch Intern Med
Mortality risks…

Sinha R, Cross AJ, Graubard BI, Leitzmann MF, Schatzkin A. Meat intake and mortality: a prospective study of over half a million people. Arch Intern Med
Overfitting
 In multivariate modeling, you can get
highly significant but meaningless
results if you put too many predictors in
the model.
 The model is fit perfectly to the quirks
of your particular sample, but has no
predictive ability in a new sample.
Overfitting: class data
example
 I asked SAS to automatically find
predictors of optimism in our class
dataset. Here’s the resulting linear
regression model:
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 11.80175 2.98341 11.96067 15.65 0.0019

exercise -0.29106 0.09798 6.74569 8.83 0.0117
sleep -1.91592 0.39494 17.98818 23.53 0.0004
obama 1.73993 0.24352 39.01944 51.05 <.0001
Clinton -0.83128 0.17066 18.13489 23.73 0.0004
mathLove 0.45653 0.10668 13.99925 18.32 0.0011

Exercise, sleep, and high ratings for Clinton are negatively related to optimism (highly
significant!) and high ratings for Obama and high love of math are positively related to
optimism (highly significant!).
If something seems to good to
be true…
Clinton, univariate:
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|

Intercept Intercept 1 5.43688 2.13476 2.55 0.0188

Clinton Clinton 1 0.24973 0.27111 0.92 0.3675
Sleep, Univariate:
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|

Intercept Intercept 1 8.30817 4.36984 1.90 0.0711

Exercise, Univariate: sleep sleep 1 -0.14484 0.65451 -0.22 0.8270
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|

Intercept Intercept 1 6.65189 0.89153 7.46 <.0001

exercise exercise 1 0.19161 0.20709 0.93 0.3658
More univariate models…

Obama, Univariate:
Parameter Standard Compare
Variable Label DF Estimate Error t Value Pr > |t| with
multivariate
Intercept Intercept 1 0.82107 2.43137 0.34 0.7389 result;
obama obama 1 0.87276 0.31973 2.73 0.0126 p<.0001

Love of Math, univariate: Compare

Parameter Standard with
Variable Label DF Estimate Error t Value Pr > |t| multivariate
result;
Intercept Intercept 1 3.70270 1.25302 2.96 0.0076 p=.0011
mathLove mathLove 1 0.59459 0.19225 3.09 0.0055
Overfitting
Rule of thumb: You need at
least 10 subjects for each
additional predictor
variable in the multivariate
regression model.

Pure noise variables still produce good R2 values if the model is

overfitted. The distribution of R2 values from a series of
simulated regression models containing only noise variables.
(Figure 1 from: Babyak, MA. What You See May Not Be What You Get: A Brief, Nontechnical Introduction
to Overfitting in Regression-Type Models. Psychosomatic Medicine 66:411-421 (2004).)
Other types of multivariate
regression
 Multiple linear regression is for normally
distributed outcomes

 Logistic regression is for binary outcomes

 Cox proportional hazards regression is used when

time-to-event is the outcome
Common multivariate regression models.

Example Appropriate Example equation What do the coefficients

Outcome outcome multivariate give you?
(dependent variable regression
variable) model

Continuous Blood Linear blood pressure (mmHg) = slopes—tells you how

pressure regression  + salt*salt consumption much the outcome
(tsp/day) + age*age (years) + variable increases for
smoker*ever smoker every 1-unit increase in
(yes=1/no=0) each predictor.

Binary High blood Logistic ln (odds of high blood pressure) = odds ratios—tells you
pressure regression  + salt*salt consumption how much the odds of the
(yes/no) (tsp/day) + age*age (years) + outcome increase for
smoker*ever smoker every 1-unit increase in
(yes=1/no=0) each predictor.

Time-to- Time-to- Cox ln (rate of death) = hazard ratios—tells you

event death regression  + salt*salt consumption how much the rate of the
(tsp/day) + age*age (years) + outcome increases for
smoker*ever smoker every 1-unit increase in
(yes=1/no=0) each predictor.
Overview of statistical tests

The following table gives the appropriate

choice of a statistical test or measure of
association for various types of data (outcome
variables and predictor variables) by study
design.
e.g., blood pressure= pounds + age + treatment (1/0)
Continuous outcome

Continuous predictors Binary predictor

Types of variables to be analyzed
Statistical procedure
Predictor variable/s Outcome variable or measure of association
Cross-sectional/case-control studies
Binary (two groups) Continuous T-test
Binary Ranks/ordinal Wilcoxon rank-sum test
Categorical (>2 groups) Continuous ANOVA
Continuous Continuous Simple linear regression
Multivariate
(categorical and Continuous Multiple linear regression
continuous)
Chi-square test (or Fisher’s
Categorical Categorical
exact)
Binary Binary Odds ratio, risk ratio
Binary Logistic regression
Multivariate
Cohort Studies/Clinical Trials
Binary Binary Risk ratio
Categorical Time-to-event Kaplan-Meier/ log-rank test
Cox-proportional hazards
Multivariate Time-to-event
regression, hazard ratio
Categorical Continuous Repeated measures ANOVA
Multivariate Continuous Mixed models; GEE modeling
Alternative summary: statistics
for various types of outcome data
Are the observations independent or
correlated?

Outcome Variable independent correlated Assumptions

Continuous Ttest Paired ttest Outcome is normally

distributed (important
ANOVA Repeated-measures ANOVA
(e.g. pain scale, for small samples).
Linear correlation Mixed models/GEE modeling Outcome and predictor
cognitive function)
Linear regression have a linear
relationship.

Binary or Difference in proportions McNemar’s test Chi-square test

assumes sufficient
Relative risks Conditional logistic
categorical regression
numbers in each cell
Chi-square test (>=5)
(e.g. fracture yes/no) Logistic regression GEE modeling

Time-to-event Kaplan-Meier statistics n/a Cox regression

assumes proportional
Continuous outcome (means);
HRP 259/HRP 262
Are the observations independent or correlated?
Outcome independent correlated Alternatives if the normality
Variable assumption is violated (and
small sample size):

Continuous Ttest: compares means Paired ttest: compares means Non-parametric statistics
(e.g. pain between two independent groups between two related groups (e.g., Wilcoxon sign-rank test:
the same subjects before and non-parametric alternative to the
scale, after)
ANOVA: compares means paired ttest
cognitive
function) between more than two
independent groups Repeated-measures Wilcoxon sum-rank test
ANOVA: compares changes (=Mann-Whitney U test): non-
Pearson’s correlation over time in the means of two or parametric alternative to the ttest
more groups (repeated
coefficient (linear measurements)
correlation): shows linear Kruskal-Wallis test: non-
correlation between two parametric alternative to ANOVA
continuous variables Mixed models/GEE
modeling: multivariate
regression techniques to compare Spearman rank correlation
Linear regression: changes over time between two or coefficient: non-parametric
multivariate regression technique more groups; gives rate of change alternative to Pearson’s correlation
used when the outcome is over time coefficient
Binary or categorical outcomes
(proportions); HRP 259/HRP 261
Are the observations correlated? Alternative to the chi-
Outcome independent correlated square test if sparse
Variable cells:

Binary or Chi-square test: McNemar’s chi-square test: Fisher’s exact test: compares
categorical compares proportions between compares binary outcome between proportions between independent
more than two groups correlated groups (e.g., before and groups when there are sparse data
(e.g. after) (some cells <5).
fracture,
yes/no) Relative risks: odds ratios
or risk ratios Conditional logistic McNemar’s exact test:
regression: multivariate compares proportions between
regression technique for a binary correlated groups when there are
Logistic regression: sparse data (some cells <5).
outcome when groups are correlated
multivariate technique used
(e.g., matched data)
when outcome is binary; gives
multivariate-adjusted odds
ratios GEE modeling: multivariate
regression technique for a binary
outcome when groups are correlated
(e.g., repeated measures)
Time-to-event outcome
(survival data); HRP 262
Are the observation groups independent or correlated? Modifications to
Outcome Cox regression
independent correlated
Variable if proportional-
hazards is
violated:

Time-to- Kaplan-Meier statistics: estimates survival functions for n/a (already over Time-dependent
event (e.g., each group (usually displayed graphically); compares survival time) predictors or time-
functions with log-rank test dependent hazard
time to
ratios (tricky!)
fracture)
Cox regression: Multivariate technique for time-to-event data;
gives multivariate-adjusted hazard ratios