0% found this document useful (0 votes)
80 views

Pairwise Sequence Alignment

Pairwise Sequence Alignment is used to identify regions of similarity that may indicate functional, structural and/or evolutionary relationships between two biological sequences (protein or nucleic acid).

Uploaded by

vanigo1824
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

Pairwise Sequence Alignment

Pairwise Sequence Alignment is used to identify regions of similarity that may indicate functional, structural and/or evolutionary relationships between two biological sequences (protein or nucleic acid).

Uploaded by

vanigo1824
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

FUNDAMENTALS OF BIOINFORMATICS

Module 17: Pairwise Sequence Alignment

The objectives of this module are:

 To describe the basic concepts of sequence alignment.


 To describe the concept of Pairwise alignment, algorithm and
methods of sequence alignment.

Basic concepts of sequence alignment

Sequence comparison is an important method for the analysis of structure and


function of biological macromolecules such as Proteins, DNA or RNA. Mutation
and selection during evolution over millions of years can result in considerable
divergence between present-day sequences derived from the same ancestral
gene. Substitutions, insertions, and deletions of the bases at the corresponding
positions of the ancestral sequence occur during evolution as a result of point
mutations. These changes cause sequence to diverge from the common ancestor
and can also lead to changes in the sequence length. As the result of mutation,
even the sequences of the same protein or gene from two closely related species
are rarely identical. Event like fusion of sequences from two different genes and
gene duplication also occurs in genomes. It is presumed that present day
homologous sequences have diverged from a common ancestral sequence
through iterative molecular changes. Sequence alignment enables the researcher
to determine whether two sequences exhibit sufficient similarity and thus draw a
conclusion that two aligned genes share a common evolutionary history and are
homologous. Alignment indicates the positions of insertions and deletions and
show us which bits of a sequence are variable and which bits are conserved.

An alignment helps us to study the following.

 Compare Nucleic acid and protein sequence to determine the functional,


structural and evolutionary relationships between them.
 The degree of sequence conservation in an alignment which reveals
evolutionary relatedness of different sequences.
 Identification and quantification of conserved regions or functional motifs.
 Identification of pseudogenes.
 Profiling of genetic disease.
 Phylogenetic analysis, ancestral sequence profiling and prediction.
 Compare a query sequence to that of the sequence present in a database
(database search)

Information generated from sequence alignment can also help in assigning


functions to unknown protein, determines the evolutionary relations of organisms,
helps to predict the 3D structure of a protein, homopolymeric stretch profiling,
identification of genetic disease and ancestral sequence profiling etc.

Mod u le 17 |1
Pair wise alignment

Among the different comparison approaches sequence alignment is the most


fundamental process. It is the procedure of comparing two (pair wise alignment) in
which one attempts to infer which positions within sequence are homologous, that
is which site share common evolutionary history. Here sequences are compared
one another searching for common character patterns and a residue-residue
correspondences are established among related sequences preserving the order
of the residues within the compared sequences. Comparing one sequence on to
another is termed as pairwise alignment and the degree of similarity between the
two is quantified using a scoring function.

The simplest approach is to line up the sequences against each other and insert
additional characters (gaps) to bring the two strings into vertical alignment (Figure
1) and score the alignment by counting how many positions match identically at
each position using a predefined scoring function. The one with the best score
would be considered the best alignment. For example, using a scoring function, let
us set the benefit of a match to +1, the cost of a mismatch to -3, and cost of
aligning a character to a gap to -4. Figure 1 shows two possible alignments of two
sequences. In the first potential alignment there are 8 matches (8 x +1 = +8), one
mismatch (1 x -3 = -3), and one site aligned with a gap (1 x -4 = -4), for a total
score of +1. In the second potential alignment there are 5 matches (+5), 4
mismatches (-12) and one site aligned with a gap (-4) for a total score of -11.

Figure 1: A simple pairwise alignment of sequences: the alignment on the


the left is better that the alignment on the right because its overall score is
larger (1 vs. -11)

In this elementary example the score increases when more identical residues
have been aligned. Conversely the process of alignment can be measured in
terms of the number of gaps introduced and the number of mismatches remaining
in the alignment. A metric relating such parameter represents the distance
between two sequences, ie, The measured terms like number of gaps and
mismatches will be smaller If the two sequences in comparison are evolutionarily
closely related.

In the ideal case, in which a sequence alignment genuinely reflects the


evolutionary history of two genes or proteins, residues that have been aligned but
are not identical would represent substitutions. Regions where the residues of one
sequence correspond to nothing in the other would be interpreted as either an
insertion into one sequence or a deletion from the other. These gaps are usually
represented in the alignment as consecutive dashes (or other punctuation

Mod u le 17 |2
character) aligned with letters. Figure 2 shows an alignment between the
homologous trypsin proteins from Mus musculus (house mouse) and Astracus
astracus (broad-fingered crayfish), from which it can be calculated that these two
sequences have 41% identity.

Figure 2. Conserved positions are often of functional importance. Alignment of


trypsin proteins of mouse (SWISS-PROT P07146) and crayfish (SWISS-PROT
P00765). Identical residues are underlined. Indicated above the alignments are
three disulfide bonds ( —S-S—), with participating cysteine residues conserved,
amino acids side chains involved in the charge relay system (asterisk), and active
side residue governing substrate specificity (diamond).

In a residue-by-residue alignment, it is often apparent that certain regions of a


protein, or perhaps specific amino acids, are more highly conserved than others.
This information may be suggestive as to which residues are most crucial for
maintaining a protein’s structure or function. In the trypsin alignment of Figure 2
the active site residues that determine substrate specificity and provide the key
residues of serine proteases correspond to conserved positions, as do the
cysteine residues that form several disulfide bonds important for maintaining the
enzyme’s structure. On the other hand, there may be other positions that do not
play a significant functional role yet happen to be identical for historical reasons.
Sequence alignments provide a useful way to gain new insights by in deducing
structural and functional properties of a novel protein from comparisons to those
that have been well studied.

The overall goal of pairwise alignment is to find the best pairing of the sequences,
such that there is maximum correspondence among residues. To achieve this
goal one sequence needs to be shifted relative to the other to find the position
were maximum matches are found. It is impossible to evaluate all possible
alignments. Take the simple case where a sequence of 100 characters is being
aligned with a sequence of 95 characters. If we do add 5 gaps to the second
sequence to bring into 100 total sites, there are approximately 55 million possible
alignments because we may need to add gaps to both sequences and the actual
number of possible alignments is significantly greater.

Module 17 |3
Scoring an alignment

Since it is possible for two sequences to be aligned in a variety of different ways


the best possible alignment for any given pair of sequences is done by calculating
a numerical value or score for the overall similarity of each possible alignment so
that the alignment can be ranked in some order. The alignment giving the best
score is referred to as the optimal alignment. The best scoring alignment will not
necessarily be the correct one and conversely that the correct alignment will not
necessarily have the best score since no scoring scheme is perfect to model the
complex evolutionary process.

A scoring scheme can either measure similarity or difference, the best score being
a maximum in the similarity and a minimum it the difference scheme. The choice
of a scoring function that reflects biological or statistical observations about known
sequences is important to producing good alignments. Protein sequences are
frequently aligned using substitution matrices that reflect the probabilities of given
character-to-character substitutions. A series of matrices called PAM matrices
(Point Accepted Mutation matrices) explicitly encode evolutionary approximations
regarding the rates and probabilities of particular amino acid mutations. Another
common series of scoring matrices, known as BLOSUM (Blocks Substitution
Matrix), encodes empirically derived substitution probabilities. Variants of both
types of matrices are used to detect sequences with differing levels of divergence

The simplest way of quantifying similarity between two sequences is percentage


identity which describes the degree to which two or more sequences are actually
at each position and is simply measured by counting the number of identical
bases or amino acids matched between the aligned sequences. Percentage
identity is obtained by dividing the number of identical matches by the total length
of the aligned region and multiplying by 100. Since there are only four different
nucleotides in nucleic acid sequences and only 20 different amino acids in protein
sequence there is always a small but finite probability for any aligned sequence
that identical residue will be matched at some position. Unrelated sequences are
expected to align matches at several positions as there are often hundreds of
residues in a protein sequence and thousands in a nucleotide sequence, therefore
the length of the sequence is also to be considered: a 30% identity over a long
alignment is less likely to have arisen by chance than a 30% identity over a very
short alignment. Significance of an alignment is therefore accessed by Statistical
methods.

Methods of sequence alignment

Alignment of two sequences is performed using the following methods:

 Dot matrix analysis


 The dynamic programming algorithm
 Word or k-tuple methods, such as used by the programs FASTA and
BLAST

Mod u le 17 |4
Dot matrix analysis

Dot matrix analysis first described by Gibbs and McIntyre (1970) is primarily a
method for comparing two sequences to look for possible alignment of characters
between the sequences, it can readily reveal the presence of insertion/ deletions
and direct and inverted repeats in protein and DNA sequences that are more
difficult to find by the other, more automated methods. This method can also be
used for predicting regions in RNA that are self-complimentary which have the
potential to forming secondary structure. The major advantage of this method for
finding sequence alignment is that all possible matches of residues between two
sequences are found leaving the investigator the choice of identifying the most
significant ones.

Dot matrix method displays any possible sequence alignments as diagonals on


the matrix and should be used first if the sequences are not known to be very
much alike. In this method once sequence (A) is listed across the top of a page
and the other sequence (B) is listed down the left side as illustrated in figure 3
starting with the first character in B, one then moves across the page keeping in
the first row and placing a dot in any column where the character in A is the same.

Figure 3: Illustration of the manner of construction of the dot plot matrix, using a
simple residue identity matrix to score an ‘X’ where a pair of identical residue is
observed (Adopted from Attwood & Parry-Smith, Pearson Education)

The second character in B is then compared to the entire A sequence, and a dot
is placed in row 2 wherever a match occurs. This process is continued until the
page is filed with dots representing all the possible matches of A character with B
character. Any region of similar sequence is revealed by a diagonal row of dots.
Isolated dots which are not on the diagonal represent random matches that are
probably not related to any significant alignment.

The plot is characterized by some apparently random dots (noise) and a central
diagonal line, where a high density of adjacent dots (the signal) indicates the
regions of greatest similarity between the two sequences.

Mod u le 17 |5
Figure 4: Graphical representation of dotplots, showing of (a) two identical
sequences (B) two highly similar sequence (c) two different but related sequences
(Adopted from Attwood & Parry-Smith, Pearson Education)

Within a dotplot two identical sequences are characterized by a single unbroken


diagonal line across the plot as shown in figure 3a. two similar sequence will be
characterized by a broken diagonal the interrupted region indicating the location of
sequence mismatches (figure 4b) a pair of distantly related sequences with fewer
similarities has a much noisier plot, in this case diagonal cluster of dots are
observed, parallel to the central diagonal separated by a distance (figure 4c) that
represents the number of insertions required to bring the sequences into correct
register

To filter out random matches between compared sequences a sliding window of


defined size can be used. Instead of comparing single sequence positions, a
window of adjacent positions in the two sequences is compared at the same time
and a dot is printed on the page only if a certain minimal number of matches occur

A dot matrix analysis can also reveal the presence of repeats of the same
sequence character many times as these repeats become apparent on the dot
matrix of a protein sequence against itself as horizontal or vertical rows of dots
that sometimes merge into rectangular or square patterns. They can also be used
to find direct and inverted repeats within sequences. Repeated regions in whole
chromosomes can also be detected by a dot matrix analysis with the help of
interactive web based computer programs.

Mod u le 17 |6
Dynamic Programming,

This method compares every pair of character in the two sequences and
generates an alignment. The alignment will include matched and mismatched
characters and gaps in the two sequences that are positioned so that the number
of matches between identical or related character is the maximum possible.
Dynamic programming guarantees optimal alignment and provides important
information for making functional, structural and evolutionary predictions on the
basis of sequence alignment. Both global and local type of alignments are made
by simple changes in the basic dynamic programming algorithm

Global alignment and Local alignment

An alignment that essentially spans the full extents of the input sequences is
called a global alignment. In global alignment two sequences to be alignment are
assumed to be generally similar over their entire length and tried to align all of the
sites optimally within the sequences. Alignment is carried out from beginning to
end of both sequences to find the best possible alignment across the entire length
between the two sequences of roughly the same length.

Figure 5: Optimal global sequence alignment. Alignment of the amino acid


sequences of human zeta-crystallin and E.coli quinine oxidoreductase.
Identical residues are marked by asterisks below the alignment and dots
indicate conserved residues.

The trypsin and quinone oxidoreductase/zeta-crystallin alignments discussed in


figure 2 and figure 5 are examples of global alignments. Proteins consisting of a
single globular domain and homologous sequences that have not diverged
substantially can often be aligned using a global strategy.

For divergent sequences which results from large-scale sequence rearrangement


and genome shuffling only subsections of the sequences may be homologous and
of variable lengths or the homologous sections may be in a different order. For
example, a long sequence may be ordered ABCDEF (where each letter
represents a section of sequence and not an individual site). A sequence

Mod u le 17 |7
inversion of section CDE may change the sequence in another species to
ABEDCF.

Figure 6: Illustration of global alignment problem. Sequences


ABCDEF and ABEDCF cannot be properly aligned because the
homologous sections of the sequences are not in the same order.

Although each section of the first sequence is homologous with a section of the
second sequence, they cannot be globally aligned, because of the rearrangement
and fails to recognize highly similar local regions between the two sequences.
Figure 6: shows possible global alignments if section C, D, or E is aligned. In each
case, the other two sections cannot be aligned properly.

Local Alignment.

Global alignment a strategy was developed before the exon/intron structure of


genes had been discovered. Many proteins do not display global patterns of
similarity but instead appear to be mosaics of modular domains. Figure 7 shows
the modular structure of two proteins involved in blood clotting: coagulation factor
XII (F12) and tissue-type plasminogen activator (PLAT). Besides the catalytic
domain, which provides the serine protease activity, these proteins have different
numbers of other structural modules: two types of fibronectin repeats, a domain
with similarity to epidermal growth factor, and a module that is called a ‘‘kringle’’
domain.

Figure 7: Modular structure of two proteins involved in blood clotting. Schematic


representation of the modular structure of human tissue plasminogen activator
and coagulation factor XII. A module labeled C is shared by several proteins
involved in blood clotting. F1 and F2 are frequently repeated units that were first
seen in fibronectin. E is a module resembling epidermal growth factor. A module
known as a ‘‘kringle domain’’ is denoted K.

These modules can be repeated or appear in different orders. Patterns of


modularity often arise by in-frame exchange of whole exons. Global alignment
methods do not take this phenomenon into account.

Mod u le 17 |8
In a local alignment, subsections of the sequences are aligned without reference
to global patterns of similarity over the entire length. Local alignment allows the
alignment program to align regions separately regardless of overall order within
the sequence length. Figure 8 shows an illustration of global and local alignment
of a sequence that is not sufficiently similar but shares common regions.

Figure 8 : Illustration of global and local alignments of two sequences that are not
sufficiently similar.

Local alignment helps to align similar regions while allowing highly divergent
regions to remain unaligned. It only finds local regions with the highest level of
similarity between the two sequences and aligns these regions without regard for
the alignment of the rest of the sequence regions. This approach can be used for
aligning more divergent sequences with the goal of searching for conserved
patterns in DNA or protein sequences and can also be used for aligning divergent
biological sequences containing only modules that are similar such as domains or
motifs. The rational for local similarity search is that functional sites like catalytic
sites of enzymes are localized to relatively short regions, which are conserved
irrespective of deletions or mutations in intervening parts of the sequence. Some
of the common publically available implementation of pairwise comparison
programs like BLAST and FastA uses local alignment strategy and are made to
run even faster by incorporating heuristics.

Needleman & Wuncsh, Smith & Waterman algorithms for pairwise


alignments,

Needleman & Wuncsh algorithms

In this approach a maximum match between two sequences is defined to be the


largest number of amino acids from one protein that can be matched with those of
another protein while allowing for all possible deletions, A penalty is introduced to
provide a barrier to arbitrary gap insertions. In its simplest form Needleman &
Wuncsh (1970) proposed a maximum-match pathway in which cells representing
identity scored 1 and cells representing mismatch are scored 0, a 2D array similar
to dotplot is populated with these values, an operation of successive summation of
cells then commences. This process examines each cell in the matrix, the
maximum score along any path leading to the cell is added to its present contents,
and the summation continues. When the process is completed, the maximum-
match pathway is constructed. An alignment is generated by working through the
matrix, starting at the highest-scoring element and the N-termini and following the
pattern of high scores through to the C-termini. Leaps to non-adjacent diagonal
cells in the matrix indicate the need for a gap insertion, to bring the sequence into
register, as long as the gap penalty barrier permits opening a gap at that point. If

Mod u le 17 |9
the barrier does not permit gap insertion, a lower-scoring pathway may have to be
taken.

Needleman & Wunsch algorithm produces an alignment that takes into account all
residue of the input sequence. The starting point of the maximum-path trackback
is always at the N-termini and is calculated from a scoring process that commence
from C-termini for this reason the method results in a global alignment

The Smith-waterman algorithm

Consider two sequence that are only distantly related to each other, even so they
exhibit small regions of local similarity although no satisfactory overall alignment
can be found. Smith-waterman in 1981 described a method for finding these
common regions of similarity. As in Needleman & Wunchs this algorithm also
employees a matrix based approach and backtracking is used reconstruct the
gapped alignments. A key feature of this algorithm is that each cell in the matrix
defines the end point of a potential alignment, whose similarity is represented by
the value stored in the cell. The algorithm begins by filling the edge elements in a
2D matrix with 0.0 values, because the cells represent the ends of alignments of
length zero and consequently, their similarity score is zero. Cells in the matrix are
populated with floating-point values rather than integer, the next step is to
populate cells in the matrix. This is achieved by evaluating three functions and
choosing the maximum of the three values, or zero if a negative value would
result, these functions consider the possibilities for ending an alignment at any
particular cell. First the similarity score (e.g 1.0 for a match, -0.333 for a mismatch
for the diagonal predecessor of the cell under consideration is added to that cells
score then the maximum value is calculated for a deletion representation along (a)
the current row of the matrix and (b) along the current column of the matrix.
Finally, if a negative score would result, 0,0 is substituted, to indicate that there is
no alignment similarity upto the current cell position. Once the matrix is complete
the highest score is located which represents the endpoint of the highest scoring
alignment between the two sequences and the other elements leading to this are
determined using a backtracking procedure.

The essential difference between the two algorithms is that in Smith-waterman the
matrix contains a maximum value that may not be at the N-termini of the
sequence. It represents the endpoint of an alignment of an alignment such that no
other pair of segments with greater similarity exists between the two sequences,
hence a local, rather than global method

Word or k-tuple methods, such as used by the programs FASTA and BLAST

Word methods, also known as k- tuple methods, are heuristic methods that are
not guaranteed to find an optimal alignment solution, but are significantly more
efficient and faster than dynamic programming. These methods are especially
useful in large- scale database searches where it is understood that a large
proportion of the candidate sequences will have essentially no significant match
with the query sequence. Word methods identify a series of short, non-
overlapping sub-sequences ("words") in the query sequence that are then
matched to candidate database sequences.

M o d u l e 1 7 | 10
The first widely-used program which performs optimized database similarity
searches for local alignments using a substitution matrix was FASTA. Since it
would take a substantial amount of time to apply this strategy exhaustively to
improve speed, FASTA uses the observed pattern of word hits to identify potential
matches before attempting the more time-consuming optimized search. In this
method, the user defines a value k to use as the word length with which to
search the database. Most potential matches are then further evaluated by
performing a search for a gapped local alignment, a full Smith-Waterman
alignment search is performed. This method is slower but more sensitive at lower
values of k, which are also preferred for searches involving a very short query
sequence.

FASTA provides an estimate of the statistical significance of each alignment


found. The program assumes an extreme value distribution for random scores but
with the use of a rewritten form of the probability density function in which the
expected score is a linear function of the natural log of the length of the database
sequence. An expectation E is calculated, which gives the expected number of
random alignments with Z -scores greater than or equal to the value observed

BLAST

The BLAST family of search methods provides a number of algorithms optimized


for particular types of queries, such as searching for distantly related sequence
matches. BLAST was developed to provide a faster alternative to FASTA without
sacrificing much accuracy; like FASTA, BLAST uses a word search of length k,
but evaluates only the most significant word matches, rather than every word
match as does FASTA. The BLAST programs introduced the idea of
neighbourhood words, Instead of requiring words to match exactly. A word hit is
achieved if the word taken from the subject sequence has a score of at least T
when a comparison is made using a substitution matrix to the word from the
query. This strategy allows the word size (W) to be kept high (for speed) without
sacrificing sensitivity. Thus, T becomes the critical parameter determining speed
and sensitivity and W is rarely varied. If the value of T is increased, the number of
background word hits will go down and the program will run faster. Reducing T
allows more distant relationships to be found. The occurrence of a word hit is
followed by an attempt to find a locally optimal alignment whose score is at least
equal to a cutoff score. This is accomplished by iteratively extending the alignment
both to the left and to the right, with accumulation of incremental scores for
matches, mismatches, and the introduction of gaps.

Interpretation of results.

Sequence alignment is useful for discovering functional, structural and


evolutionary information in biological sequences. Sequences that are very much
alike or similar in a sequence analysis, probably have the same function, for
example similar DNA sequence will have similar regulatory role, similar protein
sequences will have similar biochemical function and may have similar three
dimensional structure. If two sequences from different organism are similar they
may have been originated from a common ancestor sequence and the sequences
are then defined as being homologues.

M o d u l e 1 7 | 11
Evolution of new gene is often thought to occur by gene duplication, creating two
tandem copies of the gene, followed by mutations in these copies, in rare cases,
new mutations in one of the copies provide an advantageous change in function.
In due course these two copies may evolve along separate pathways, although
the resulting separation of function will generate two related sequence families,
sequences among both families will still be similar due to the single gene
ancestor. Sequence similarity search also help in identifying several possible
types of ancestor relationships like Orthologs, Paralogs, Analogs etc.
interpretation of alignments can also help to sort out possible evolutionary origins
among similar sequences.

Alignment can reveal homology between sequences, In all methods of sequence


comparison, the fundamental question is whether the similarities perceived
between two sequences are due to chance, and are thus of little biological
significance, or whether they are due to the deviation of the sequences from a
common ancestral sequence, and are thus homologous.

If two sequences in an alignment share a common ancestor, mismatch in the


alignment can be interpreted as point mutation and gaps as indels (insertion or
deletions) introduced in one or both lineages in the time since they diverged from
one another.

In sequence alignment of proteins, the degree of similarity between amino acids


occupying a particular position in the sequence can be interpreted as a rough
measure of how conserved a particular region or sequence motif is among
lineages. The presence of conserved regions or only very conservative
substitution suggests that this region has structural importance. Similarly
presence of conserved base pairs indicate similar functional or structural role.

Any standard alignment program will produce some statistical value indicating the
probability that an alignment of a given quality could arise by chance, but does not
indicate how much superior a given alignment is to alternative alignments of the
same sequences. The statistics quoted in BLAST for pairwise comparisons are
probability (p) or expected frequency (E) values. The p-value relates the score
returned for an alignment to the likelihood of its having arisen by chance is real,
the closer the value approaches to zero, the greater the confidence that the match
is real, conversely the nearer the value to unity the greater the chance the match
is spurious..

The E- value describes the number of hits one can expect to see by chance (in
other word noise) when searching a database of a particular size. For example an
E-value of 1 assigned to a hit can be interpreted as meaning that, in the current
search one might expect to see one match with a similar score simply by chance,
conversely a value if zero indicates that no matches would be expected by
chance.

M o d u l e 1 7 | 12

You might also like