Project in Statistics: Local Alignment and Linear Regression
Project in Statistics: Local Alignment and Linear Regression
1.1 Introduction
The purpose of this project is two-fold. First of all it is a project about linear regression and
how to fit (and evaluate the fit) of a linear regression model using least squares estimation.
Second, the data you are going to consider comes from local alignment of proteins, and
you are thus going to get some insight into the statistical assessment of (local) alignments.
1.2 Formalities
This project is the first out of two projects in the course Statistics BI. The deadline for
handing in the project is December 21 at 12.00. The project must be solved in groups of
2-3 persons, and the complete solution must form a coherent project containing arguments,
formulas, estimates, graphs etc. to answer the questions, but computer code, data printouts
or the like will not be considered. The project is evaluated as passed/not-passed. In case
you do not pass, you have one week to redo the project for a second and final evaluation.
1.3 Background
The traditional way of assessing the significance of a local alignment of two amino acid
sequences is to compare the score of the optimal local alignment with the distribution
of the optimal score when locally aligning independent random amino acid sequences.
Usually random is taken to mean iid sequences. The actual distribution of the optimal
local alignment score can not be found explicitly, but one can rely on an approximation.
1
2 Local alignment and linear regression
A general form of the approximation says that the distribution of the score, S, is given
(approximately) as
S = β0 + β1 log(nm) + σ 2 ε (1.1)
where ε is Gumbel distributed with mean 0 and variance 1, and n and m are the lengths
of the two amino acid sequences. The parameters β0 , β1 ∈ R, and σ 2 > 0 depend upon
the score matrix and gap penalties chosen for the local alignment algorithm and the
distribution of the sequences – thus if the sequences are assumed to be iid sequences, the
distribution is given by the amino acid probabilities.
You are going to investigate to what extent the model given by (1.1) is a good model and
whether least squares linear regression is appropriate for estimating the parameters.
1.4 Problems
with K, λ > 0.
• Find the mean and variance of the distribution given by (1.2). Argue that the family
of distributions given by (1.1), in terms of the β0 , β1 , σ 2 parameterisation, contains
the family given by (1.2), and derive a representation of λ and K in terms of β0 , β1
and σ 2 .
For the rest of the project you will need the datasets aaseq1.fa and aaseq2.fa, which
each contains 200 proteins. The proteins are chosen randomly from NCBI’s database of
Human protein sequences. In the following, when you are asked to locally align the proteins
from the two files pairwise, you need to align the first sequences from each file, the second
sequences from each file, etc. and end up with a total of 200 local alignment scores. For
the purpose of doing local alignment, you need to use an implementation of the Smith-
Waterman algorithm. Any implementation will do (as long as it is correct), but you can
choose to use the program la available for download on the homepage of the course.
• Based on the dataset, estimate the parameters β0 , β1 , and σ 2 based upon least
squares linear regression when locally aligning the proteins from aaseq1.txt with
aaseq2.txt using BLOSUM62, gap open penalty 12 and gap extension penalty 1.
• Discuss how well the model given by (1.1) fits the data.
• Compute the corresponding value of λ and K. Compare with standard values taking
into account the uncertainty of the estimates.
for K, λ, γ > 0.
• Investigate if the introduction of the correction factor enhances the fit of the model.
The use of least squares linear regression is based upon certain assumptions.
• Discuss what kind of distributional assumptions we must make about the proteins
in the two files for actually using least squares linear regression.
• It would be tempting to consider just one of the files and align all proteins pairwise
within the file yielding a total of 1990 different local alignment scores. How will
that necessarily violate the assumptions, and to what extent do you believe that the
violation is serious?
4 Local alignment and linear regression
Linear Regression
Suppose that X1 , . . . , Xn are random variables taking values in R, and that they are given
by
Xi = g0 (yi ) + εi
with ε1 , . . . , εn being iid with mean 0 and variance σ 2 , y1 , . . . , yn being n (known) variables
with values in E, and that g0 : E → R is some function. Then
thus the function g0 gives the mean of Xi in terms of the additional variable yi . We call
the yi ’s covariates, regressors or predictors. One can also encounter the terminology that
yi is the independent variable and Xi is the dependent variable. The objective is then,
based on a realisation, x1 , . . . , xn , of the random variables, to infer the functional relation
g0 between the covariates y1 , . . . , yn and the means of the variables X1 , . . . , Xn .
The method we consider here is that of least squares estimation. For any function g we
can compute the sum of squares
n
X
ss(g) = (xi − g(yi ))2 ,
i=1
which measures the distance from the observations to the mean of the corresponding
random variable – provided that g = g0 . We don’t know g0 and as we vary g the resulting
ss(g) changes too. If we look at the theoretical counterpart of ss(g),
n
X
SS(g) = (Xi − g(yi ))2 ,
i=1
5
6 Local alignment and linear regression
The first term does not depend upon g and the second term is minimised (equalling 0)
whenever g(yi ) = g0 (yi ) for i = 1, . . . , n. Based on this result we will try to find a g that
minimises the sum of squares ss(g) as an estimate of the true, unknown g0 .
The reader may have noticed that any choice of g for which g(yi ) = xi actually minimises
ss(g) – and, moreover, gives that the sum of squares equals 0. So if we are allowed to choose
an arbitrary function g, we can always find a function that fits the observations exactly
in the sense of making ss(g) = 0. This it not necessarily desirable, and we talk about
overfitting of the model to the data. After all, the expected value of the sum of squares is
at least going to be nσ 2 . Most often we don’t want the function g to be arbitrary, but we
have instead a fixed class of functions in mind that we want to minimise ss(g) over.
If f : E → R is a known function we can consider the set of functions given by
g(y) = β0 + β1 f (y)
concentration of the product after 1 minute. Thus with X the concentration of the product
we will consider
g(y) = β0 + β1 y
where y is the concentration of the enzyme and β0 , β1 ∈ R. Here the interpretation of β0
is the rate by which the process occurs without the enzyme, and β1 is the increase per
unit enzyme of the rate. Hopefully, the parameter β1 is positive – otherwise the addition
of the enzyme would make the process slow down instead. We seek to minimise
n
X
ss(g) = ss(β0 , β1 ) = (xi − β0 − β1 yi )2
i=1
g(y, z) = β0 + β1 y + β2 z
as a function of β0 , β1 , β2 ∈ R.
It is a serious model assumption – not an innocent one – to assume that the effect of the
two proteins is additive. It means that the change in concentration of enzyme 2 affect the
process exactly the same way independently of the concentration of the other enzyme. It
is easy to imagine situations where the two enzymes somehow interfere with each other so
that mixing the enzymes can cause the process to become slower than just adding one of
the enzymes. One can also imagine situations where there is a synergy effect so that the
total speed up is more than the sum of the individual terms. Such model assumptions cry
out loud for model control. That is, for methods to investigate if the model assumptions
are valid.
Example 2.1.2. Let the covariates be time points t1 , . . . , tn , for instance with time mea-
sured in hours we can have n = 48 and t1 = 01.00, t2 = 02.00, . . . , tn = 48.00, and consider
the measurement of the body temperature of an animal as a function of time. We believe
that it is oscillating with one complete cycle every 24 hours, that is
g0 (t) = β0 + β1 cos(2πt/24).
On Figure 2.1 we see a simulated example where β0 = 37, β1 = 0.5 and the εi ’s are iid
N (0, 0.2). The figure also show the fitted function ĝ from the dataset consisting of 48
observations using least squares linear regression. In this case β̂0 = 37.0 and β̂1 = 0.46.
The variance is estimated as σˆ2 = 0.22.
8 38.0 Local alignment and linear regression
38.0
37.5
37.5
37.0
37.0
36.5
36.5
36.0
36.0
0 10 20 30 40 0 10 20 30 40
Figure 2.1: An example dataset generated by the model from Example 2.1.2 (left) with
the estimated ĝ. The other dataset (right) was generated using a function g that does not
fit into the framework of Example 2.1.2.
can be minimised analytically. This essentially means that by introducing some math-
ematics (linear algebra and matrices) we can write down a closed form solution of the
β-parameters in terms of the xi ’s and the fk (yik )’s that minimise the sum of squares.
However, what is more interesting from a practical point of view is that the minimisation
problem is very well behaved in the sense that in every concrete case one can easily tell
if there is a unique solution1 and in that case the solution can be computed extremely
efficiently.
Once the estimates β̂0 , β̂1 , . . . , β̂n that minimise the sum of squares have been found we
have the fitted mean value of Xi
xi − ĝ(yi ), i = 1, . . . , n.
1
the requirement is that the vectors (fk (yk1 ), . . . , fk (ykn )) for k = 1, . . . , d together with (1, 1, . . . , 1)
are linearly independent
Project in Statistics 9
That is, the residuals are the observations minus the fitted values. The variance of the
ε-variables, σ 2 , is estimated from the residuals by
n
1 1
σˆ2 =
X
(xi − ĝ(yi ))2 = ss(ĝ). (2.2)
n−d−1 n−d−1
i=1
p
and the estimate of the standard error is σ̂ = σˆ2 .
R Box 2.2.1 (Linear regression). The function lm can be used to perform all
sorts of least squares linear regression estimation in R. The two parameters you
need to specify when calling lm is a formula and (optionally) a data frame con-
taining the data and covariates. If mydata is a data frame containing two rows
with names x and y we can call
to fit the parameters (etc.) and assigning the resulting lm object to mylm.
By summary(mylm) you get a summary of the information contained in the
lm object. If you just want the estimated parameters, they are obtained by
mylm$coefficients. By plot(mylm,which=1) you get a plot of the residuals
vs. the fitted values. The capabilities of lm and a complete description of the re-
sulting lm object is beyond the scope of this R box. You are encouraged to consult
the help pages, help(lm), for more information. You can find information about
the formula specification of the model by help(formula).
To have any faith in the fitted model, we need to somehow justify that it actually fits the
observed data, e.g. that the functional form of g is appropriate and that the ε-variables
are independent and identically distributed. Moreover, we are sometimes in a position
where we also assume a known distribution for the ε-variables. For instance that that the
variables are normally distributed with mean 0 and variance σ 2 . In that case we would
also like to check if such an assumption is fulfilled. Such a model-control can be carried
out in more or less formal ways by considering and analysing the residuals.
A residual plot is a plot of the residuals against the fitted values. Since
εi = Xi − g(yi )
we see that the residuals are approximations of the actual (but unobserved) εi ’s. Thus the
residuals should essentially resemble the behaviour of iid variables. What we look for in the
residual plot are systematic deviations from the iid assumption. If the residuals for instance
are mostly positive for small fitted values and negative for large fitted values we get an
indication of that the fitted g does not capture the means of the Xi variables sufficiently
well. Other deviations can indicate dependence between the ε-variables. Another thing to
look for is whether the residuals are more spread out for e.g. large fitted values than for
10 Local alignment and linear regression
Residuals vs Fitted
4
81 82
2
Residuals
0
−2
4 5 6 7 8 9
Fitted values
lm(x ~ y + z)
Figure 2.2: A residual plot corresponding to Example 2.1.1, but where the functional form
of g when adding both enzymes is g(y, z) = exp(−0.1z)y + exp(−0.1y)z. Thus the enzymes
cancel out each others effect. The result is clear from the residual plot. Too many negative
residuals for large and small fitted values of g, and too many positive in between.
small fitted values. This indicates that the variance is not constant but actually depends
upon the covariate.
If we also assume that εi is normally distributed, we can check this assumption by com-
paring the empirical distribution of the residuals with the normal distribution by e.g. a
QQplot.
The estimators β̂0 , β̂1 , . . . , β̂d , and σˆ2 are all transformations of the observable random
variables X1 , . . . , Xn . The distribution of the estimators, thus the distribution of the trans-
formation of X1 , . . . , Xn given by the estimators, tells us something about how good the
estimators are. The distribution depends upon the true value of the parameters and the
distribution of the εi ’s. It is in general not possible to find the distribution of the estima-
tors, but if εi ∼ N (0, σ 2 ) then it is actually possible.
If εi ∼ N (0, σ 2 ) it holds that
• The d + 1 dimensional vector (β̂0 , β̂1 , . . . , β̂d ) and σˆ2 are independent.
10 31
41
0.4
0.5
0.2
Residuals
Residuals
0.0
0.0
−0.2
−0.5
−0.4
25
19 13
36.6 36.8 37.0 37.2 37.4 36.4 36.6 36.8 37.0 37.2
Figure 2.3: Continuing Example 2.1.2 we see the corresponding residual plot (left) for the
dataset that was generated using the linear regression model. The residual plot for the
dataset not from the model (right) indicates that the fit is not good.
We will not introduce the formulas for the parameters entering the normal distribution of
(β̂0 , β̂1 , . . . , β̂d ) here, but just remark that the marginal distribution of β̂k is N (βk0 , γk2 σ 2 )
where γk2 is a function of the (known) fk (yik )’s.
These distributional results only hold if ε ∼ N (0, σ 2 ), but one of the more remarkable
results from probability theory gives that the results are approximately true if n is large
no matter what the distribution of the εi ’s is – as long as it has a finite variance2 and
they are independent. This result allows us to use the distributions above in general to
assess the uncertainty of the estimated parameters provided that n is sufficiently large –
whatever that means.
If there is no effect for instance of the covariate yk from the covariate vector y = (y1 , . . . , yd )
on the mean value of the X’s then the true value of βk is βk0 = 0. This means that the
chance of getting a value of the estimate with |βˆk | > z can be computed from the normal
distribution (using symmetry of the normal distribution) by
x2
2
Z ∞
ˆ ˆ
P(|βk | > z) = 2P(βk > z) = √ exp − dx.
2πγk σ z 2γk σ 2
For practical
p purposes we approximate the unknown standard error σ by the estimated
value σ̂ = σˆ2 . As a rule of thump, if βk0 = 0 there is approximately 32% chance that
2
If Vε = ∞ one enters a completely different ball game
12 Local alignment and linear regression
|βˆk | > γk σ̂, approximately 5% chance that |βˆk | > 2γk σ̂, and approximately 0.3% chance
that |βˆk | > 3γk σ̂. Accurate computation of the probabilities are, however, easily based on
the distribution function of the normal distribution.
If ε ∼ N (0, σ 2 ) more is even true, because then the distribution of
βˆk − βk0
γk σ̂
is known. It is a T -distribution with n−d−1 degrees of freedom. One would therefore prefer
using the T -distribution instead of the normal distribution to compute e.g. the probability
P(|βˆk | > γk σ̂) given that βk0 = 0. The feature of the T -distribution that distinguish it
from the normal distribution is that it is more spread out for small n. With n − d − 1 = 5
the probability P(|βˆk | > 2γk σ̂) evaluated using the T -distribution is close to 10% instead
of 5%, but with n − d − 1 = 50, the probability has dropped to less than 5.1% for the
T -distribution.
There are, however, three provisos before drawing any serious conclusions based on the
marginal distributional results above. The first one is that if the model is wrong, then the
results need not hold, and the conclusions drawn may be meaningless or downright wrong.
We may conclude that βk = 0 and thus that the covariate yk does not affect the mean, but
if the model is wrong and the covariate actually affects the mean in a non-additive way for
instance, the conclusion is meaningless and based upon wrong initial assumptions. There-
fore it is compulsory to check that the model assumptions are fulfilled – to the extent it is
possible of course. Second, if the distribution of the εi ’s is not an iid normal distribution,
the distribution of the estimators is only approximate and if the approximation is bad, the
conclusions may again be wrong. The approximations get better with more observations,
but it is difficult to give general guidelines for how many observations we need. Finally,
the considerations presented above are all marginal. This means that we consider only the
distribution of each of the β̂k estimators one at the time.