0% found this document useful (0 votes)

59 views

Project in Statistics: Local Alignment and Linear Regression

This document discusses linear regression and its application to modeling local protein sequence alignment scores. It begins with an introduction and background on using linear regression to model the distribution of optimal local alignment scores between random sequences. The document then provides the problems to be solved, which involve using linear regression to estimate parameters of a Gumbel distribution fit to alignment scores between two datasets of protein sequences. It discusses assumptions and extensions of the linear regression model for this application.

Uploaded by

marioasensicollantes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views

Project in Statistics: Local Alignment and Linear Regression

Uploaded by

marioasensicollantes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Project in Statistics

Local alignment and linear regression

1.1 Introduction

The purpose of this project is two-fold. First of all it is a project about linear regression and
how to fit (and evaluate the fit) of a linear regression model using least squares estimation.
Second, the data you are going to consider comes from local alignment of proteins, and
you are thus going to get some insight into the statistical assessment of (local) alignments.

1.2 Formalities

This project is the first out of two projects in the course Statistics BI. The deadline for
handing in the project is December 21 at 12.00. The project must be solved in groups of
2-3 persons, and the complete solution must form a coherent project containing arguments,
formulas, estimates, graphs etc. to answer the questions, but computer code, data printouts
or the like will not be considered. The project is evaluated as passed/not-passed. In case
you do not pass, you have one week to redo the project for a second and final evaluation.

1.3 Background

The traditional way of assessing the significance of a local alignment of two amino acid
sequences is to compare the score of the optimal local alignment with the distribution
of the optimal score when locally aligning independent random amino acid sequences.
Usually random is taken to mean iid sequences. The actual distribution of the optimal
local alignment score can not be found explicitly, but one can rely on an approximation.

1
2 Local alignment and linear regression

A general form of the approximation says that the distribution of the score, S, is given
(approximately) as
S = β0 + β1 log(nm) + σ 2 ε (1.1)
where ε is Gumbel distributed with mean 0 and variance 1, and n and m are the lengths
of the two amino acid sequences. The parameters β0 , β1 ∈ R, and σ 2 > 0 depend upon
the score matrix and gap penalties chosen for the local alignment algorithm and the
distribution of the sequences – thus if the sequences are assumed to be iid sequences, the
distribution is given by the amino acid probabilities.
You are going to investigate to what extent the model given by (1.1) is a good model and
whether least squares linear regression is appropriate for estimating the parameters.

1.4 Problems

The Gumbel approximation of the score, S, is often expressed as

P (S ≤ x) = exp(−Knm exp(−λx)), (1.2)

with K, λ > 0.

• Find the mean and variance of the distribution given by (1.2). Argue that the family
of distributions given by (1.1), in terms of the β0 , β1 , σ 2 parameterisation, contains
the family given by (1.2), and derive a representation of λ and K in terms of β0 , β1
and σ 2 .

For the rest of the project you will need the datasets aaseq1.fa and aaseq2.fa, which
each contains 200 proteins. The proteins are chosen randomly from NCBI’s database of
Human protein sequences. In the following, when you are asked to locally align the proteins
from the two files pairwise, you need to align the first sequences from each file, the second
sequences from each file, etc. and end up with a total of 200 local alignment scores. For
the purpose of doing local alignment, you need to use an implementation of the Smith-
Waterman algorithm. Any implementation will do (as long as it is correct), but you can
choose to use the program la available for download on the homepage of the course.

• Based on the dataset, estimate the parameters β0 , β1 , and σ 2 based upon least
squares linear regression when locally aligning the proteins from aaseq1.txt with
aaseq2.txt using BLOSUM62, gap open penalty 12 and gap extension penalty 1.

• Discuss how well the model given by (1.1) fits the data.

• Compute the corresponding value of λ and K. Compare with standard values taking
into account the uncertainty of the estimates.

• What happens if the gap open penalty is set to 5 or 1 instead of 12?

Project in Statistics 3

An extension of (1.2) consists of introducing a correction factor so that

P (S ≤ x) = exp(−Knm log(nm)γ exp(−λx)). (1.3)

for K, λ, γ > 0.

• Enlarge (1.1) by including a new term so that it captures (1.3) as well.

• Investigate if the introduction of the correction factor enhances the fit of the model.

The use of least squares linear regression is based upon certain assumptions.

• Discuss what kind of distributional assumptions we must make about the proteins
in the two files for actually using least squares linear regression.

• It would be tempting to consider just one of the files and align all proteins pairwise
within the file yielding a total of 1990 different local alignment scores. How will
that necessarily violate the assumptions, and to what extent do you believe that the
violation is serious?
4 Local alignment and linear regression
Linear Regression

2.1 The model

Suppose that X1 , . . . , Xn are random variables taking values in R, and that they are given
by
Xi = g0 (yi ) + εi
with ε1 , . . . , εn being iid with mean 0 and variance σ 2 , y1 , . . . , yn being n (known) variables
with values in E, and that g0 : E → R is some function. Then

EXi = Eg0 (yi ) + Eεi = g0 (yi ),

thus the function g0 gives the mean of Xi in terms of the additional variable yi . We call
the yi ’s covariates, regressors or predictors. One can also encounter the terminology that
yi is the independent variable and Xi is the dependent variable. The objective is then,
based on a realisation, x1 , . . . , xn , of the random variables, to infer the functional relation
g0 between the covariates y1 , . . . , yn and the means of the variables X1 , . . . , Xn .
The method we consider here is that of least squares estimation. For any function g we
can compute the sum of squares
n
X
ss(g) = (xi − g(yi ))2 ,
i=1

which measures the distance from the observations to the mean of the corresponding
random variable – provided that g = g0 . We don’t know g0 and as we vary g the resulting
ss(g) changes too. If we look at the theoretical counterpart of ss(g),
n
X
SS(g) = (Xi − g(yi ))2 ,
i=1

5
6 Local alignment and linear regression

we see that the expectation of SS(g) is

n
X
ESS(g) = E(Xi − g(yi ))2
i=1
n
X n
X
= Vεi + (g0 (yi ) − g(yi ))2
i=1 i=1
n
X
= nσ 2 + (g0 (yi ) − g(yi ))2 .
i=1

The first term does not depend upon g and the second term is minimised (equalling 0)
whenever g(yi ) = g0 (yi ) for i = 1, . . . , n. Based on this result we will try to find a g that
minimises the sum of squares ss(g) as an estimate of the true, unknown g0 .
The reader may have noticed that any choice of g for which g(yi ) = xi actually minimises
ss(g) – and, moreover, gives that the sum of squares equals 0. So if we are allowed to choose
an arbitrary function g, we can always find a function that fits the observations exactly
in the sense of making ss(g) = 0. This it not necessarily desirable, and we talk about
overfitting of the model to the data. After all, the expected value of the sum of squares is
at least going to be nσ 2 . Most often we don’t want the function g to be arbitrary, but we
have instead a fixed class of functions in mind that we want to minimise ss(g) over.
If f : E → R is a known function we can consider the set of functions given by

g(y) = β0 + β1 f (y)

for unknown parameters β0 , β1 ∈ R. Thus we restrict our attention to functions that

are linear (or more correctly, affine) in the transformation f (y) of the covariate y. More
generally, the covariate y may be an entire d-tuple, y = (y1 , . . . , yd ), of covariates and we
may consider the set of functions given by

g(y) = β0 + β1 f1 (y1 ) + . . . + βd fd (yd ) (2.1)

for f1 , . . . , fd being d known functions and β0 , β1 , . . . , βd ∈ R being d + 1 unknown para-

meters. The i’th covariate corresponding to the stochastic variable Xi , yi = (yi1 , . . . , yid ),
is thus a d-tuple. When the g-functions we want to consider are given by (2.1) we say
that we have a linear regression model. It is noticeable that linearity is related to how
the unknown parameters enter in the mean of the X-variables and not how the covariates
enter. We may transform the covariates using an arbitrary non-linear function as long as
it is known and fixed.

Example 2.1.1 (Michaelis-Menten). When the transformation of a substrate into a prod-

uct is catalysed by an enzyme, and when the concentration of the enzyme is much lower
than of the substrate, the Michaelis-Menten mechanism predicts that the rate of formation
of the product is linear in the concentration of the enzyme. We make n = 10 experiments,
and in the experiment i we have concentration yi of the enzyme. Then we measure the
Project in Statistics 7

concentration of the product after 1 minute. Thus with X the concentration of the product
we will consider
g(y) = β0 + β1 y
where y is the concentration of the enzyme and β0 , β1 ∈ R. Here the interpretation of β0
is the rate by which the process occurs without the enzyme, and β1 is the increase per
unit enzyme of the rate. Hopefully, the parameter β1 is positive – otherwise the addition
of the enzyme would make the process slow down instead. We seek to minimise
n
X
ss(g) = ss(β0 , β1 ) = (xi − β0 − β1 yi )2
i=1

as a function of the two parameters β0 , β1 ∈ R.

We may have another enzyme, enzyme 2, which can also increase the rate of the process.
We may assume that the effect of adding enzyme 2 is additive and thus that

g(y, z) = β0 + β1 y + β2 z

where y is the concentration of the original enzyme, z is the concentration of enzyme 2,

and β0 , β1 , β2 ∈ R. Thus we need to minimise
n
X
ss(g) = ss(β0 , β1 , β2 ) = (xi − β0 − β1 yi − β2 zi )2
i=1

as a function of β0 , β1 , β2 ∈ R.
It is a serious model assumption – not an innocent one – to assume that the effect of the
two proteins is additive. It means that the change in concentration of enzyme 2 affect the
process exactly the same way independently of the concentration of the other enzyme. It
is easy to imagine situations where the two enzymes somehow interfere with each other so
that mixing the enzymes can cause the process to become slower than just adding one of
the enzymes. One can also imagine situations where there is a synergy effect so that the
total speed up is more than the sum of the individual terms. Such model assumptions cry
out loud for model control. That is, for methods to investigate if the model assumptions
are valid.
Example 2.1.2. Let the covariates be time points t1 , . . . , tn , for instance with time mea-
sured in hours we can have n = 48 and t1 = 01.00, t2 = 02.00, . . . , tn = 48.00, and consider
the measurement of the body temperature of an animal as a function of time. We believe
that it is oscillating with one complete cycle every 24 hours, that is

g0 (t) = β0 + β1 cos(2πt/24).

On Figure 2.1 we see a simulated example where β0 = 37, β1 = 0.5 and the εi ’s are iid
N (0, 0.2). The figure also show the fitted function ĝ from the dataset consisting of 48
observations using least squares linear regression. In this case β̂0 = 37.0 and β̂1 = 0.46.
The variance is estimated as σˆ2 = 0.22.
8 38.0 Local alignment and linear regression

38.0
37.5

37.5
37.0

37.0
36.5

36.5
36.0

36.0
0 10 20 30 40 0 10 20 30 40

Figure 2.1: An example dataset generated by the model from Example 2.1.2 (left) with
the estimated ĝ. The other dataset (right) was generated using a function g that does not
fit into the framework of Example 2.1.2.

2.2 Estimation and model control

One tractable feature of the model given by (2.1) is that the sum of squares
n
X
ss(β0 , β1 , . . . , βn ) = (xi − β0 − β1 f1 (yi1 ) − . . . − βn fd (yid ))2
i=1

can be minimised analytically. This essentially means that by introducing some math-
ematics (linear algebra and matrices) we can write down a closed form solution of the
β-parameters in terms of the xi ’s and the fk (yik )’s that minimise the sum of squares.
However, what is more interesting from a practical point of view is that the minimisation
problem is very well behaved in the sense that in every concrete case one can easily tell
if there is a unique solution1 and in that case the solution can be computed extremely
efficiently.
Once the estimates β̂0 , β̂1 , . . . , β̂n that minimise the sum of squares have been found we
have the fitted mean value of Xi

ĝ(yi ) = β̂0 + β̂1 f1 (yi1 ) + . . . + β̂d fd (yid )

and we define the residuals as

xi − ĝ(yi ), i = 1, . . . , n.
1
the requirement is that the vectors (fk (yk1 ), . . . , fk (ykn )) for k = 1, . . . , d together with (1, 1, . . . , 1)
are linearly independent
Project in Statistics 9

That is, the residuals are the observations minus the fitted values. The variance of the
ε-variables, σ 2 , is estimated from the residuals by
n
1 1
σˆ2 =
X
(xi − ĝ(yi ))2 = ss(ĝ). (2.2)
n−d−1 n−d−1
i=1
p
and the estimate of the standard error is σ̂ = σˆ2 .

R Box 2.2.1 (Linear regression). The function lm can be used to perform all
sorts of least squares linear regression estimation in R. The two parameters you
need to specify when calling lm is a formula and (optionally) a data frame con-
taining the data and covariates. If mydata is a data frame containing two rows
with names x and y we can call

> mylm <- lm(x ∼ y, data = mydata)

to fit the parameters (etc.) and assigning the resulting lm object to mylm.
By summary(mylm) you get a summary of the information contained in the
lm object. If you just want the estimated parameters, they are obtained by
mylm$coefficients. By plot(mylm,which=1) you get a plot of the residuals
vs. the fitted values. The capabilities of lm and a complete description of the re-
sulting lm object is beyond the scope of this R box. You are encouraged to consult
the help pages, help(lm), for more information. You can find information about
the formula specification of the model by help(formula).

To have any faith in the fitted model, we need to somehow justify that it actually fits the
observed data, e.g. that the functional form of g is appropriate and that the ε-variables
are independent and identically distributed. Moreover, we are sometimes in a position
where we also assume a known distribution for the ε-variables. For instance that that the
variables are normally distributed with mean 0 and variance σ 2 . In that case we would
also like to check if such an assumption is fulfilled. Such a model-control can be carried
out in more or less formal ways by considering and analysing the residuals.
A residual plot is a plot of the residuals against the fitted values. Since

εi = Xi − g(yi )

we see that the residuals are approximations of the actual (but unobserved) εi ’s. Thus the
residuals should essentially resemble the behaviour of iid variables. What we look for in the
residual plot are systematic deviations from the iid assumption. If the residuals for instance
are mostly positive for small fitted values and negative for large fitted values we get an
indication of that the fitted g does not capture the means of the Xi variables sufficiently
well. Other deviations can indicate dependence between the ε-variables. Another thing to
look for is whether the residuals are more spread out for e.g. large fitted values than for
10 Local alignment and linear regression

Residuals vs Fitted

4
81 82

2
Residuals

0
−2

4 5 6 7 8 9

Fitted values
lm(x ~ y + z)

Figure 2.2: A residual plot corresponding to Example 2.1.1, but where the functional form
of g when adding both enzymes is g(y, z) = exp(−0.1z)y + exp(−0.1y)z. Thus the enzymes
cancel out each others effect. The result is clear from the residual plot. Too many negative
residuals for large and small fitted values of g, and too many positive in between.

small fitted values. This indicates that the variance is not constant but actually depends
upon the covariate.
If we also assume that εi is normally distributed, we can check this assumption by com-
paring the empirical distribution of the residuals with the normal distribution by e.g. a
QQplot.

2.3 Distribution of the estimators

The estimators β̂0 , β̂1 , . . . , β̂d , and σˆ2 are all transformations of the observable random
variables X1 , . . . , Xn . The distribution of the estimators, thus the distribution of the trans-
formation of X1 , . . . , Xn given by the estimators, tells us something about how good the
estimators are. The distribution depends upon the true value of the parameters and the
distribution of the εi ’s. It is in general not possible to find the distribution of the estima-
tors, but if εi ∼ N (0, σ 2 ) then it is actually possible.
If εi ∼ N (0, σ 2 ) it holds that

• The d + 1 dimensional vector (β̂0 , β̂1 , . . . , β̂d ) and σˆ2 are independent.

• The distribution of (β̂0 , β̂1 , . . . , β̂d ) is a d + 1-dimensional normal distribution.

Project in Statistics 11

Residuals vs Fitted Residuals vs Fitted

10 31
41
0.4

0.5
0.2
Residuals

Residuals
0.0

0.0
−0.2

−0.5
−0.4

19 13

36.6 36.8 37.0 37.2 37.4 36.4 36.6 36.8 37.0 37.2

Fitted values Fitted values

lm(x ~ g1(myt)) lm(x ~ g1(myt))

Figure 2.3: Continuing Example 2.1.2 we see the corresponding residual plot (left) for the
dataset that was generated using the linear regression model. The residual plot for the
dataset not from the model (right) indicates that the fit is not good.

• The distribution of σˆ2 /σ 2 is a χ2 -distribution with d + 1 degrees of freedom.

We will not introduce the formulas for the parameters entering the normal distribution of
(β̂0 , β̂1 , . . . , β̂d ) here, but just remark that the marginal distribution of β̂k is N (βk0 , γk2 σ 2 )
where γk2 is a function of the (known) fk (yik )’s.
These distributional results only hold if ε ∼ N (0, σ 2 ), but one of the more remarkable
results from probability theory gives that the results are approximately true if n is large
no matter what the distribution of the εi ’s is – as long as it has a finite variance2 and
they are independent. This result allows us to use the distributions above in general to
assess the uncertainty of the estimated parameters provided that n is sufficiently large –
whatever that means.
If there is no effect for instance of the covariate yk from the covariate vector y = (y1 , . . . , yd )
on the mean value of the X’s then the true value of βk is βk0 = 0. This means that the
chance of getting a value of the estimate with |βˆk | > z can be computed from the normal
distribution (using symmetry of the normal distribution) by

x2

2
Z ∞
ˆ ˆ
P(|βk | > z) = 2P(βk > z) = √ exp − dx.
2πγk σ z 2γk σ 2

For practical
p purposes we approximate the unknown standard error σ by the estimated
value σ̂ = σˆ2 . As a rule of thump, if βk0 = 0 there is approximately 32% chance that
2
If Vε = ∞ one enters a completely different ball game
12 Local alignment and linear regression

|βˆk | > γk σ̂, approximately 5% chance that |βˆk | > 2γk σ̂, and approximately 0.3% chance
that |βˆk | > 3γk σ̂. Accurate computation of the probabilities are, however, easily based on
the distribution function of the normal distribution.
If ε ∼ N (0, σ 2 ) more is even true, because then the distribution of

βˆk − βk0
γk σ̂
is known. It is a T -distribution with n−d−1 degrees of freedom. One would therefore prefer
using the T -distribution instead of the normal distribution to compute e.g. the probability
P(|βˆk | > γk σ̂) given that βk0 = 0. The feature of the T -distribution that distinguish it
from the normal distribution is that it is more spread out for small n. With n − d − 1 = 5
the probability P(|βˆk | > 2γk σ̂) evaluated using the T -distribution is close to 10% instead
of 5%, but with n − d − 1 = 50, the probability has dropped to less than 5.1% for the
T -distribution.
There are, however, three provisos before drawing any serious conclusions based on the
marginal distributional results above. The first one is that if the model is wrong, then the
results need not hold, and the conclusions drawn may be meaningless or downright wrong.
We may conclude that βk = 0 and thus that the covariate yk does not affect the mean, but
if the model is wrong and the covariate actually affects the mean in a non-additive way for
instance, the conclusion is meaningless and based upon wrong initial assumptions. There-
fore it is compulsory to check that the model assumptions are fulfilled – to the extent it is
possible of course. Second, if the distribution of the εi ’s is not an iid normal distribution,
the distribution of the estimators is only approximate and if the approximation is bad, the
conclusions may again be wrong. The approximations get better with more observations,
but it is difficult to give general guidelines for how many observations we need. Finally,
the considerations presented above are all marginal. This means that we consider only the
distribution of each of the β̂k estimators one at the time.

Math1041 Study Notes For UNSW
No ratings yet
Math1041 Study Notes For UNSW
16 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Sec2 Regression PDF
No ratings yet
Sec2 Regression PDF
183 pages
What Is Simple Linear Regression?
No ratings yet
What Is Simple Linear Regression?
7 pages
R Lab 4
No ratings yet
R Lab 4
7 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
Fdsa UNIT V
No ratings yet
Fdsa UNIT V
18 pages
6th Lecture Note 108335647 230518 203102
No ratings yet
6th Lecture Note 108335647 230518 203102
35 pages
RegrCorr PDF
No ratings yet
RegrCorr PDF
20 pages
Machine Learning Unit2
No ratings yet
Machine Learning Unit2
31 pages
FML Unit2
No ratings yet
FML Unit2
13 pages
18 SL Regression 1 320E F21
No ratings yet
18 SL Regression 1 320E F21
40 pages
Example Class One
No ratings yet
Example Class One
4 pages
Lecture 2
No ratings yet
Lecture 2
23 pages
Chapter2 (Simple Linear Regression)
No ratings yet
Chapter2 (Simple Linear Regression)
11 pages
ch12 0
No ratings yet
ch12 0
82 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
27 pages
Lecture Notes on High Dimensional Linear Regression
No ratings yet
Lecture Notes on High Dimensional Linear Regression
73 pages
Annotated 4 Ch4 Linear Regression F2014
No ratings yet
Annotated 4 Ch4 Linear Regression F2014
11 pages
reg
No ratings yet
reg
110 pages
Chapter 02
No ratings yet
Chapter 02
14 pages
Sparse Regression
No ratings yet
Sparse Regression
37 pages
Statics Thinking-Regression
No ratings yet
Statics Thinking-Regression
51 pages
Stats 101 - Class 03
No ratings yet
Stats 101 - Class 03
94 pages
Biostat Lecture 10
No ratings yet
Biostat Lecture 10
47 pages
Regression Notes- Part-1
No ratings yet
Regression Notes- Part-1
17 pages
L7-CurveFitting(LeastSquaresRegression)
No ratings yet
L7-CurveFitting(LeastSquaresRegression)
45 pages
13 Chapter14
No ratings yet
13 Chapter14
28 pages
Lecture 12 - Adv. Correlation and Multiple Regression
No ratings yet
Lecture 12 - Adv. Correlation and Multiple Regression
32 pages
Intronumericalrecipes v01 Chapter02 Regress
No ratings yet
Intronumericalrecipes v01 Chapter02 Regress
15 pages
R-programming - Unit 5
No ratings yet
R-programming - Unit 5
43 pages
CISE301-Topic 3 Curve Fitting
No ratings yet
CISE301-Topic 3 Curve Fitting
38 pages
Section 2
No ratings yet
Section 2
22 pages
Statistics Week3
No ratings yet
Statistics Week3
19 pages
Machine Learning-Lecture 1(Student)
No ratings yet
Machine Learning-Lecture 1(Student)
14 pages
ANUM 2012 Curve-Fitting
No ratings yet
ANUM 2012 Curve-Fitting
44 pages
Topic 7 Linear Regreation CHP14
No ratings yet
Topic 7 Linear Regreation CHP14
21 pages
Introduction To Linear Regression
No ratings yet
Introduction To Linear Regression
6 pages
Regression PDF
No ratings yet
Regression PDF
18 pages
Residual Analysis For Simple Linear Regression: X B B y N e N e
No ratings yet
Residual Analysis For Simple Linear Regression: X B B y N e N e
15 pages
Lecture 4
No ratings yet
Lecture 4
22 pages
4-Curve Fitting and Interpolation
No ratings yet
4-Curve Fitting and Interpolation
48 pages
Curve Fitting
No ratings yet
Curve Fitting
17 pages
Stat Modelling Notes
No ratings yet
Stat Modelling Notes
49 pages
Reference Material Linear Regression
No ratings yet
Reference Material Linear Regression
12 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Lecture Week 12 - Intro To Regression
No ratings yet
Lecture Week 12 - Intro To Regression
5 pages
Simple Regression
100% (1)
Simple Regression
50 pages
Lect5 Math231
No ratings yet
Lect5 Math231
31 pages
Ch17 Curve Fitting
No ratings yet
Ch17 Curve Fitting
44 pages
Reference+Material Linear Regression
No ratings yet
Reference+Material Linear Regression
12 pages
Statistical Testing and Prediction Using Linear Regression: Abstract
No ratings yet
Statistical Testing and Prediction Using Linear Regression: Abstract
10 pages
Unit 3 Notes
100% (2)
Unit 3 Notes
32 pages
Lab-5-1-Regression and Multiple Regression
100% (2)
Lab-5-1-Regression and Multiple Regression
8 pages
Matlab Homework Experts 2
No ratings yet
Matlab Homework Experts 2
10 pages
Clase 11 Calculo Numerico I
No ratings yet
Clase 11 Calculo Numerico I
37 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Regression 101
No ratings yet
Regression 101
18 pages
ISLP - Website 135 200
No ratings yet
ISLP - Website 135 200
66 pages
ISLP - Website-135-200 (1) - 1-60
No ratings yet
ISLP - Website-135-200 (1) - 1-60
60 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Groups13 Sol4
No ratings yet
Groups13 Sol4
16 pages
Lie Groups 2011
No ratings yet
Lie Groups 2011
20 pages
Lie Algebra 1
No ratings yet
Lie Algebra 1
9 pages
1 Lie Groups: 1.1 The General Linear Group
No ratings yet
1 Lie Groups: 1.1 The General Linear Group
13 pages
Ginzburg - Linear Algebra Problems
No ratings yet
Ginzburg - Linear Algebra Problems
5 pages
7.2 Integration by Parts: 7.2.1 The Idea by An Example
No ratings yet
7.2 Integration by Parts: 7.2.1 The Idea by An Example
12 pages
LG 04
No ratings yet
LG 04
21 pages
SU (2) and SO (3) : 1 The Group of Rotations
No ratings yet
SU (2) and SO (3) : 1 The Group of Rotations
5 pages
Introducing Distance and Measurement in General Relativity: Changes For The Standard Tests and The Cosmological Large-Scale
No ratings yet
Introducing Distance and Measurement in General Relativity: Changes For The Standard Tests and The Cosmological Large-Scale
7 pages
Dimension of Lie Groups
No ratings yet
Dimension of Lie Groups
4 pages
Special Relativity: Massachusetts Institute of Technology
No ratings yet
Special Relativity: Massachusetts Institute of Technology
18 pages
Group Theory - QMII 2016 1 Lie Algebra: 1.1 Reminder
No ratings yet
Group Theory - QMII 2016 1 Lie Algebra: 1.1 Reminder
8 pages
Lie Groups
No ratings yet
Lie Groups
7 pages
LaTeX - WinEdt Hacker's Guide PDF
No ratings yet
LaTeX - WinEdt Hacker's Guide PDF
40 pages
HANDOUT 1. Probability Basics: Experiment Outcome Sample Space
No ratings yet
HANDOUT 1. Probability Basics: Experiment Outcome Sample Space
4 pages
General Relativity in Dimensions: A D Boozer
No ratings yet
General Relativity in Dimensions: A D Boozer
15 pages
Mecánica Clásica - Curso 2008: Segundo Parcial - 30/6/2008
No ratings yet
Mecánica Clásica - Curso 2008: Segundo Parcial - 30/6/2008
2 pages
(Ebook) - LaTeX - A Graduate Student Guide To LaTeX and AMS-LaTeX PDF
No ratings yet
(Ebook) - LaTeX - A Graduate Student Guide To LaTeX and AMS-LaTeX PDF
23 pages
Gateway Tips-SP-16 PDF
No ratings yet
Gateway Tips-SP-16 PDF
1 page
Lial3pt1 PDF
No ratings yet
Lial3pt1 PDF
6 pages
Statistic: This Class Is For
No ratings yet
Statistic: This Class Is For
13 pages
MATH 140-Practice Test 3 Revised April 30, 2014
No ratings yet
MATH 140-Practice Test 3 Revised April 30, 2014
12 pages
PracticeProblems Integration PDF
No ratings yet
PracticeProblems Integration PDF
5 pages
Functions: Function F, Defined From A Set A To A Set B, Is A Rule That Associates With Each
No ratings yet
Functions: Function F, Defined From A Set A To A Set B, Is A Rule That Associates With Each
22 pages
A3 PDF
No ratings yet
A3 PDF
6 pages
Algebra Tiebreaker Solutions - 2 PDF
No ratings yet
Algebra Tiebreaker Solutions - 2 PDF
2 pages
Functions of Several Variables: Mountain
No ratings yet
Functions of Several Variables: Mountain
4 pages
Practice - III Without Highlighted Answers
No ratings yet
Practice - III Without Highlighted Answers
18 pages
Qreg
No ratings yet
Qreg
104 pages
Michaelis Menten Equation
No ratings yet
Michaelis Menten Equation
19 pages
TWI-2008-Reliability of Manually Applied Phased Array Ultrasonic Inspection For Detection and Sizing of Flaws PDF
No ratings yet
TWI-2008-Reliability of Manually Applied Phased Array Ultrasonic Inspection For Detection and Sizing of Flaws PDF
176 pages
Ijerph 18 00798
No ratings yet
Ijerph 18 00798
19 pages
SLR Solved Example
No ratings yet
SLR Solved Example
6 pages
An introduction to IoT Analytics 1st Edition Harry G Perros - Download the ebook now to never miss important information
100% (1)
An introduction to IoT Analytics 1st Edition Harry G Perros - Download the ebook now to never miss important information
56 pages
Prediction of Side Weir Discharge Coefficient by Genetic Programming Technique
No ratings yet
Prediction of Side Weir Discharge Coefficient by Genetic Programming Technique
10 pages
Solution Basic Econometrics
No ratings yet
Solution Basic Econometrics
10 pages
Linear Regression
No ratings yet
Linear Regression
8 pages
Icd 384 PDF
No ratings yet
Icd 384 PDF
16 pages
EDU 411 Topic 5 Data Analysis
No ratings yet
EDU 411 Topic 5 Data Analysis
9 pages
Treatment With Tocilizumab or Corticosteroids For COVID 2021 Clinical Micro
No ratings yet
Treatment With Tocilizumab or Corticosteroids For COVID 2021 Clinical Micro
9 pages
E-Governance Corruption
No ratings yet
E-Governance Corruption
23 pages
Chapter 13 Homework Answer: 13.2 The First Equation Omits The 1981 Year Dummy Variable, Y81, and So Does Not Allow
No ratings yet
Chapter 13 Homework Answer: 13.2 The First Equation Omits The 1981 Year Dummy Variable, Y81, and So Does Not Allow
4 pages
CampusX DSMP Syllabus
No ratings yet
CampusX DSMP Syllabus
48 pages
Stata Output For ANCOVA Section
No ratings yet
Stata Output For ANCOVA Section
8 pages
Assignment 3 - Muge Zorlu
No ratings yet
Assignment 3 - Muge Zorlu
6 pages
Factors That Influencing Customers Satisfaction: A Case Study From Marketing Branch of PDAM Tirtanadi, Medan Amplas, North Sumatra, Indonesia
No ratings yet
Factors That Influencing Customers Satisfaction: A Case Study From Marketing Branch of PDAM Tirtanadi, Medan Amplas, North Sumatra, Indonesia
9 pages
Linear Regression Sample Problem
No ratings yet
Linear Regression Sample Problem
1 page
Research: Burden of Care Amongst Caregivers Who Are First Degree Relatives of Patients With Schizophrenia
No ratings yet
Research: Burden of Care Amongst Caregivers Who Are First Degree Relatives of Patients With Schizophrenia
10 pages
COSM CAT-II Model Paper-A8005
No ratings yet
COSM CAT-II Model Paper-A8005
2 pages
Correlation Design
No ratings yet
Correlation Design
43 pages
Impact Evaluation in Practice (Second Edition) Offers A Comprehensive and Accessible Introduction To
No ratings yet
Impact Evaluation in Practice (Second Edition) Offers A Comprehensive and Accessible Introduction To
39 pages
Forecasting Stock Returns: What Signals Matter, and What Do They Say Now?
No ratings yet
Forecasting Stock Returns: What Signals Matter, and What Do They Say Now?
20 pages
What Kind of Democracy Do We All Support
No ratings yet
What Kind of Democracy Do We All Support
31 pages
Stat501.101 SU14
No ratings yet
Stat501.101 SU14
2 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
13 pages

Project in Statistics: Local Alignment and Linear Regression

Uploaded by

Project in Statistics: Local Alignment and Linear Regression

Uploaded by

Project in Statistics

Local alignment and linear regression

The Gumbel approximation of the score, S, is often expressed as

P (S ≤ x) = exp(−Knm exp(−λx)), (1.2)

• What happens if the gap open penalty is set to 5 or 1 instead of 12?

An extension of (1.2) consists of introducing a correction factor so that

P (S ≤ x) = exp(−Knm log(nm)γ exp(−λx)). (1.3)

• Enlarge (1.1) by including a new term so that it captures (1.3) as well.

2.1 The model

EXi = Eg0 (yi ) + Eεi = g0 (yi ),

we see that the expectation of SS(g) is

for unknown parameters β0 , β1 ∈ R. Thus we restrict our attention to functions that

g(y) = β0 + β1 f1 (y1 ) + . . . + βd fd (yd ) (2.1)

for f1 , . . . , fd being d known functions and β0 , β1 , . . . , βd ∈ R being d + 1 unknown para-

Example 2.1.1 (Michaelis-Menten). When the transformation of a substrate into a prod-

as a function of the two parameters β0 , β1 ∈ R.

where y is the concentration of the original enzyme, z is the concentration of enzyme 2,

2.2 Estimation and model control

ĝ(yi ) = β̂0 + β̂1 f1 (yi1 ) + . . . + β̂d fd (yid )

and we define the residuals as

> mylm <- lm(x ∼ y, data = mydata)

2.3 Distribution of the estimators

• The distribution of (β̂0 , β̂1 , . . . , β̂d ) is a d + 1-dimensional normal distribution.

Residuals vs Fitted Residuals vs Fitted

Fitted values Fitted values

• The distribution of σˆ2 /σ 2 is a χ2 -distribution with d + 1 degrees of freedom.

You might also like