Alignment of Sequences
Alignment of Sequences
V A T C T G A T G
W T G C A T A C
indels deletion
insertion
Simple scoring
When mismatches are penalized by –μ, indels
are penalized by –σ,
and matches are rewarded with +1,
the resulting score is:
Sum 1 1
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ...
A R N D C Q E G H I L K ...
Ala A 13 6 9 9 5 8 9 12 6 8 6 7 ...
Arg R 3 17 4 3 2 5 3 2 6 3 2 9
Asn N 4 4 6 7 2 5 6 4 6 3 2 5
Asp D 5 4 8 11 1 7 10 5 6 3 2 5
Cys C 2 1 1 1 52 1 1 2 2 2 1 1
Gln Q 3 5 5 6 1 10 7 3 7 2 3 5
...
Trp W 0 2 0 0 0 0 0 0 1 0 1 0
Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1
Val V 7 4 4 4 4 4 4 4 5 4 15 10
BLOSUM
Blocks Substitution Matrix
Scores derived from observations of the
frequencies of substitutions in blocks of local
alignments in related proteins
Matrix name indicates evolutionary distance
– BLOSUM62 was created using sequences
sharing no more than 62% identity
BLOSUM versus PAM
The PAM family
– PAM matrices are based on global alignments of closely related proteins.
– The PAM1 is the matrix calculated from comparisons of sequences with no
more than 1% divergence; Other PAM matrices are extrapolated from
PAM1.
The BLOSUM family
– BLOSUM matrices are based on local alignments (blocks)
– All BLOSUM matrices are based on observed alignments (BLOSUM 62 is a
matrix calculated from comparisons of sequences with no less than 62%
divergence)
Higher numbers in the PAM matrix naming scheme
denote larger evolutionary distance; BLOSUM is the
opposite.
– For alignment of distant proteins, you use PAM150 instead of PAM100, or
BLOSUM50 instead of BLOSUM62.
Local vs. global alignment
• Global Alignment
--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC
| || | || | | | ||| || | | | | |||| |
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C
Compute a “mini”
Global Alignment to
Local alignment
get Local
Global alignment
Local alignment: running time
• Long run time O(n6):
• The recurrence:
Notice there is only
0
this change from the
si,j = max si-1,j-1 + δ (vi, wj) original recurrence of
s i-1,j + δ (vi, -) a Global Alignment
s i,j-1 + δ (-, wj)
The local alignment recurrence
• The largest value of si,j over the whole edit graph
is the score of the best local alignment.
• The recurrence:
Power of ZERO: there is
0 only this change from the
si,j = max si-1,j-1 + δ (vi, wj) original recurrence of a
s i-1,j + δ (vi, -) Global Alignment - since
there is only one “free ride”
s i,j-1 + δ (-, wj) edge entering into every
vertex
• Complexity: O(N2), or O(MN)
• Initialization will be different
Scoring indels: naive approach
A fixed penalty σ is given to every indel:
– -σ for 1 indel,
– -2σ for 2 consecutive indels
– -3σ for 3 consecutive indels, etc.
Match or mismatch
End with deletion
End with insertion
Readings
Chapter 4, 4.1
– “Alignment can reveal homology between
sequences” (similarity vs homology)
– “It is easier to detect homology when comparing
protein sequences than when comparing nucleic
acid sequences”
Primer article: What is dynamic programming by
Eddy
We will continue on
Significance of an alignment (score)
– Homologous or not?