Pairwise Sequence Alignment
Pairwise Sequence Alignment
Mod u le 17 |1
Pair wise alignment
The simplest approach is to line up the sequences against each other and insert
additional characters (gaps) to bring the two strings into vertical alignment (Figure
1) and score the alignment by counting how many positions match identically at
each position using a predefined scoring function. The one with the best score
would be considered the best alignment. For example, using a scoring function, let
us set the benefit of a match to +1, the cost of a mismatch to -3, and cost of
aligning a character to a gap to -4. Figure 1 shows two possible alignments of two
sequences. In the first potential alignment there are 8 matches (8 x +1 = +8), one
mismatch (1 x -3 = -3), and one site aligned with a gap (1 x -4 = -4), for a total
score of +1. In the second potential alignment there are 5 matches (+5), 4
mismatches (-12) and one site aligned with a gap (-4) for a total score of -11.
In this elementary example the score increases when more identical residues
have been aligned. Conversely the process of alignment can be measured in
terms of the number of gaps introduced and the number of mismatches remaining
in the alignment. A metric relating such parameter represents the distance
between two sequences, ie, The measured terms like number of gaps and
mismatches will be smaller If the two sequences in comparison are evolutionarily
closely related.
Mod u le 17 |2
character) aligned with letters. Figure 2 shows an alignment between the
homologous trypsin proteins from Mus musculus (house mouse) and Astracus
astracus (broad-fingered crayfish), from which it can be calculated that these two
sequences have 41% identity.
The overall goal of pairwise alignment is to find the best pairing of the sequences,
such that there is maximum correspondence among residues. To achieve this
goal one sequence needs to be shifted relative to the other to find the position
were maximum matches are found. It is impossible to evaluate all possible
alignments. Take the simple case where a sequence of 100 characters is being
aligned with a sequence of 95 characters. If we do add 5 gaps to the second
sequence to bring into 100 total sites, there are approximately 55 million possible
alignments because we may need to add gaps to both sequences and the actual
number of possible alignments is significantly greater.
Module 17 |3
Scoring an alignment
A scoring scheme can either measure similarity or difference, the best score being
a maximum in the similarity and a minimum it the difference scheme. The choice
of a scoring function that reflects biological or statistical observations about known
sequences is important to producing good alignments. Protein sequences are
frequently aligned using substitution matrices that reflect the probabilities of given
character-to-character substitutions. A series of matrices called PAM matrices
(Point Accepted Mutation matrices) explicitly encode evolutionary approximations
regarding the rates and probabilities of particular amino acid mutations. Another
common series of scoring matrices, known as BLOSUM (Blocks Substitution
Matrix), encodes empirically derived substitution probabilities. Variants of both
types of matrices are used to detect sequences with differing levels of divergence
Mod u le 17 |4
Dot matrix analysis
Dot matrix analysis first described by Gibbs and McIntyre (1970) is primarily a
method for comparing two sequences to look for possible alignment of characters
between the sequences, it can readily reveal the presence of insertion/ deletions
and direct and inverted repeats in protein and DNA sequences that are more
difficult to find by the other, more automated methods. This method can also be
used for predicting regions in RNA that are self-complimentary which have the
potential to forming secondary structure. The major advantage of this method for
finding sequence alignment is that all possible matches of residues between two
sequences are found leaving the investigator the choice of identifying the most
significant ones.
Figure 3: Illustration of the manner of construction of the dot plot matrix, using a
simple residue identity matrix to score an ‘X’ where a pair of identical residue is
observed (Adopted from Attwood & Parry-Smith, Pearson Education)
The second character in B is then compared to the entire A sequence, and a dot
is placed in row 2 wherever a match occurs. This process is continued until the
page is filed with dots representing all the possible matches of A character with B
character. Any region of similar sequence is revealed by a diagonal row of dots.
Isolated dots which are not on the diagonal represent random matches that are
probably not related to any significant alignment.
The plot is characterized by some apparently random dots (noise) and a central
diagonal line, where a high density of adjacent dots (the signal) indicates the
regions of greatest similarity between the two sequences.
Mod u le 17 |5
Figure 4: Graphical representation of dotplots, showing of (a) two identical
sequences (B) two highly similar sequence (c) two different but related sequences
(Adopted from Attwood & Parry-Smith, Pearson Education)
A dot matrix analysis can also reveal the presence of repeats of the same
sequence character many times as these repeats become apparent on the dot
matrix of a protein sequence against itself as horizontal or vertical rows of dots
that sometimes merge into rectangular or square patterns. They can also be used
to find direct and inverted repeats within sequences. Repeated regions in whole
chromosomes can also be detected by a dot matrix analysis with the help of
interactive web based computer programs.
Mod u le 17 |6
Dynamic Programming,
This method compares every pair of character in the two sequences and
generates an alignment. The alignment will include matched and mismatched
characters and gaps in the two sequences that are positioned so that the number
of matches between identical or related character is the maximum possible.
Dynamic programming guarantees optimal alignment and provides important
information for making functional, structural and evolutionary predictions on the
basis of sequence alignment. Both global and local type of alignments are made
by simple changes in the basic dynamic programming algorithm
An alignment that essentially spans the full extents of the input sequences is
called a global alignment. In global alignment two sequences to be alignment are
assumed to be generally similar over their entire length and tried to align all of the
sites optimally within the sequences. Alignment is carried out from beginning to
end of both sequences to find the best possible alignment across the entire length
between the two sequences of roughly the same length.
Mod u le 17 |7
inversion of section CDE may change the sequence in another species to
ABEDCF.
Although each section of the first sequence is homologous with a section of the
second sequence, they cannot be globally aligned, because of the rearrangement
and fails to recognize highly similar local regions between the two sequences.
Figure 6: shows possible global alignments if section C, D, or E is aligned. In each
case, the other two sections cannot be aligned properly.
Local Alignment.
Mod u le 17 |8
In a local alignment, subsections of the sequences are aligned without reference
to global patterns of similarity over the entire length. Local alignment allows the
alignment program to align regions separately regardless of overall order within
the sequence length. Figure 8 shows an illustration of global and local alignment
of a sequence that is not sufficiently similar but shares common regions.
Figure 8 : Illustration of global and local alignments of two sequences that are not
sufficiently similar.
Local alignment helps to align similar regions while allowing highly divergent
regions to remain unaligned. It only finds local regions with the highest level of
similarity between the two sequences and aligns these regions without regard for
the alignment of the rest of the sequence regions. This approach can be used for
aligning more divergent sequences with the goal of searching for conserved
patterns in DNA or protein sequences and can also be used for aligning divergent
biological sequences containing only modules that are similar such as domains or
motifs. The rational for local similarity search is that functional sites like catalytic
sites of enzymes are localized to relatively short regions, which are conserved
irrespective of deletions or mutations in intervening parts of the sequence. Some
of the common publically available implementation of pairwise comparison
programs like BLAST and FastA uses local alignment strategy and are made to
run even faster by incorporating heuristics.
Mod u le 17 |9
the barrier does not permit gap insertion, a lower-scoring pathway may have to be
taken.
Needleman & Wunsch algorithm produces an alignment that takes into account all
residue of the input sequence. The starting point of the maximum-path trackback
is always at the N-termini and is calculated from a scoring process that commence
from C-termini for this reason the method results in a global alignment
Consider two sequence that are only distantly related to each other, even so they
exhibit small regions of local similarity although no satisfactory overall alignment
can be found. Smith-waterman in 1981 described a method for finding these
common regions of similarity. As in Needleman & Wunchs this algorithm also
employees a matrix based approach and backtracking is used reconstruct the
gapped alignments. A key feature of this algorithm is that each cell in the matrix
defines the end point of a potential alignment, whose similarity is represented by
the value stored in the cell. The algorithm begins by filling the edge elements in a
2D matrix with 0.0 values, because the cells represent the ends of alignments of
length zero and consequently, their similarity score is zero. Cells in the matrix are
populated with floating-point values rather than integer, the next step is to
populate cells in the matrix. This is achieved by evaluating three functions and
choosing the maximum of the three values, or zero if a negative value would
result, these functions consider the possibilities for ending an alignment at any
particular cell. First the similarity score (e.g 1.0 for a match, -0.333 for a mismatch
for the diagonal predecessor of the cell under consideration is added to that cells
score then the maximum value is calculated for a deletion representation along (a)
the current row of the matrix and (b) along the current column of the matrix.
Finally, if a negative score would result, 0,0 is substituted, to indicate that there is
no alignment similarity upto the current cell position. Once the matrix is complete
the highest score is located which represents the endpoint of the highest scoring
alignment between the two sequences and the other elements leading to this are
determined using a backtracking procedure.
The essential difference between the two algorithms is that in Smith-waterman the
matrix contains a maximum value that may not be at the N-termini of the
sequence. It represents the endpoint of an alignment of an alignment such that no
other pair of segments with greater similarity exists between the two sequences,
hence a local, rather than global method
Word or k-tuple methods, such as used by the programs FASTA and BLAST
Word methods, also known as k- tuple methods, are heuristic methods that are
not guaranteed to find an optimal alignment solution, but are significantly more
efficient and faster than dynamic programming. These methods are especially
useful in large- scale database searches where it is understood that a large
proportion of the candidate sequences will have essentially no significant match
with the query sequence. Word methods identify a series of short, non-
overlapping sub-sequences ("words") in the query sequence that are then
matched to candidate database sequences.
M o d u l e 1 7 | 10
The first widely-used program which performs optimized database similarity
searches for local alignments using a substitution matrix was FASTA. Since it
would take a substantial amount of time to apply this strategy exhaustively to
improve speed, FASTA uses the observed pattern of word hits to identify potential
matches before attempting the more time-consuming optimized search. In this
method, the user defines a value k to use as the word length with which to
search the database. Most potential matches are then further evaluated by
performing a search for a gapped local alignment, a full Smith-Waterman
alignment search is performed. This method is slower but more sensitive at lower
values of k, which are also preferred for searches involving a very short query
sequence.
BLAST
Interpretation of results.
M o d u l e 1 7 | 11
Evolution of new gene is often thought to occur by gene duplication, creating two
tandem copies of the gene, followed by mutations in these copies, in rare cases,
new mutations in one of the copies provide an advantageous change in function.
In due course these two copies may evolve along separate pathways, although
the resulting separation of function will generate two related sequence families,
sequences among both families will still be similar due to the single gene
ancestor. Sequence similarity search also help in identifying several possible
types of ancestor relationships like Orthologs, Paralogs, Analogs etc.
interpretation of alignments can also help to sort out possible evolutionary origins
among similar sequences.
Any standard alignment program will produce some statistical value indicating the
probability that an alignment of a given quality could arise by chance, but does not
indicate how much superior a given alignment is to alternative alignments of the
same sequences. The statistics quoted in BLAST for pairwise comparisons are
probability (p) or expected frequency (E) values. The p-value relates the score
returned for an alignment to the likelihood of its having arisen by chance is real,
the closer the value approaches to zero, the greater the confidence that the match
is real, conversely the nearer the value to unity the greater the chance the match
is spurious..
The E- value describes the number of hits one can expect to see by chance (in
other word noise) when searching a database of a particular size. For example an
E-value of 1 assigned to a hit can be interpreted as meaning that, in the current
search one might expect to see one match with a similar score simply by chance,
conversely a value if zero indicates that no matches would be expected by
chance.
M o d u l e 1 7 | 12