Statistical Methods For Bioinformatics Lecture 3
Statistical Methods For Bioinformatics Lecture 3
1/33
Statistical Methods for Bioinformatics
Agenda
2/33
Statistical Methods for Bioinformatics
High dimensional datasets are common in Bioinformatics
1
Comparative Genome Hybridisation to identify copy number changes 3/33
Statistical Methods for Bioinformatics
High dimensional datasets are a challenge
4/33
Statistical Methods for Bioinformatics
The perennial trade-off: bias vs variance
5/33
Statistical Methods for Bioinformatics
Choose the Optimal Model
The Principle of Parsimony (Occam’s Razor)
An explanation is better if it is simple.
e.g. models should be pared down until they are minimal and
adequate
6/33
Statistical Methods for Bioinformatics
Linear Model Selection and Regularization
Subset Selection
Identifying a subset of all p predictors X that we believe to be
related to the response Y, and then fitting the model using
this subset
Best subset selection, forward and backward step-wise
selection
Shrinkage
Shrinking the estimates coefficients towards zero.
This shrinkage reduces the variance. Some of the coefficients
may shrink to exactly zero, and hence shrinkage methods can
also perform variable selection
E.g. Ridge regression and the Lasso
Dimension Reduction (next class)
Involves projecting all p predictors into an M-dimensional
space where M < p, and then fitting linear regression model
Principle Components Regression and Partial Least Squares 7/33
Statistical Methods for Bioinformatics
Subset selection algorithm
In this approach, we run a model (e.g. a linear regression) for
each possible combination of the X predictors
For k=1,2,...p:
p
Fit all models that contain k predictors.
k
Pick the best among these, best is defined as having the best
performance, e.g. smallest RSS, or largest R2.
Select a single best model across sizes taking into account the
number of parameters: Cp, AIC, BIC, and adjusted R2 ;
Cross-Validated prediction error.
8/33
Statistical Methods for Bioinformatics
Refresher: definitions
RSS
1−
TSS
9/33
Statistical Methods for Bioinformatics
Subset Selection and Step-wise Selection
10/33
Statistical Methods for Bioinformatics
Forward Step-wise Selection
1 Start from an empty model
2 For k = 0, . . . , p − 1
1 Consider all p − k models that augment the predictor set with
one
2 Choose the best according to a performance measure (e.g.
RSS)
3 Compare the models of different sizes using Cp, AIC, BIC, and
adjusted R2 ; Cross-Validated prediction error.
Considerations
1 Considerable less models fit (1 + p2 (p + 1) vs 2p )
2 Works well in practice, but may be sub-optimal due to
correlation and interaction between variables
1 additions of a new variable may make already included
variables “non-significant”
2 an optimal pair or triple of parameters may be missed in the
early phases by the progressive procedure
11/33
Statistical Methods for Bioinformatics
Backward Step-wise Selection
1 Start from an full model with all predictors
2 For k = p, p − 1, . . . , 1
1 Consider all k models that remove one predictor from the
predictor set
2 Choose the best according to a performance measure (e.g.
RSS)
3 Compare the models of different sizes using Cp, AIC, BIC, and
adjusted R2 ; Cross-Validated prediction error.
Considerations
1 Considerable less models fit than “Best Subset” (1 + p2 (p + 1)
vs 2p )
2 Works well in practice, avoids missing successful combinations
between variables
Only possible when p<n (without further constraints and with
Ordinary Least Squares)
No guarantee optimal solution
12/33
Statistical Methods for Bioinformatics
Predictor Set Size Penalized Performance Measures
13/33
Statistical Methods for Bioinformatics
Predictor Set Size Penalized Performance Measures
1
Cp = (RSS + 2d σ̂ 2 )
n
d - # of predictors; n - # of observations; σˆ2 - estimate of error variance
14/33
Statistical Methods for Bioinformatics
(Cross)-Validation
15/33
Statistical Methods for Bioinformatics
Credit data example (from the book)
16/33
Statistical Methods for Bioinformatics
Choose the Optimal Model
17/33
Statistical Methods for Bioinformatics
Shrinkage/regularization methods
18/33
Statistical Methods for Bioinformatics
Shrinkage methods
Lasso minimizes:
p
X
RSS + λ |βj |
j=1
19/33
Statistical Methods for Bioinformatics
Shrinkage adds a penalty on coefficients
20/33
Statistical Methods for Bioinformatics
Figure Explanation
21/33
Statistical Methods for Bioinformatics
The strength of Ridge/Lasso vs Ordinary Least Squares
Ideally the penalty in the fit reduces variance at the cost of a
small increase in bias.
In some cases variance can seriously hamper Least Squares
fits. In the simulated dataset analysed below there are 45
predictors and 50 datapoints. Variance is large with n close to
p. Least Squares matches the right extreme of the right panel.
The shrinkage methods can work with n<p.
22/33
Statistical Methods for Bioinformatics
The strength of Ridge/Lasso vs Ordinary Least Squares
In the simulated dataset analysed below there are 45
predictors and 50 datapoints, but only 2 predictors are
associated with the response
23/33
Statistical Methods for Bioinformatics
Selecting the Tuning Parameter λ
24/33
Statistical Methods for Bioinformatics
Some nuts and bolts
No penalty is imposed on the intercept.
The penalty is formulated on the size of the coefficients, this
implies that the scale of the variables is important! Therefore
the variables should be standardized to make them
comparable:
xi,j
x̃i,j = q P
1 n 2
2 i=1 (xi,j − x¯j )
26/33
Statistical Methods for Bioinformatics
Lasso vs ridge regression
27/33
Statistical Methods for Bioinformatics
Lasso vs ridge regression
28/33
Statistical Methods for Bioinformatics
Lasso vs ridge regression
29/33
Statistical Methods for Bioinformatics
A generalization of the penalty
31/33
Statistical Methods for Bioinformatics
Recap
32/33
Statistical Methods for Bioinformatics
To do:
Exercises
Exercise 6.8.1 and 6.8.2
Do Lab 6.5 and 6.6
Exercises 6.8.5, 6.8.8, 6.8.10
33/33
Statistical Methods for Bioinformatics