Alignments Lecture
Alignments Lecture
5p
Lecture notes
Alignments in bioinformatics
Lecture notes
Compiled by:
1
Pharmaceutical Bioinformatics, 7.5p
Lecture notes
ALIGNMENTS IN BIOINFORMATICS 1
Sequence Analysis 3
Biological Background for Sequence Analysis 3
Searching of databases for sequences similar to a new sequence 4
Sequence alignment 5
Structural alignments 14
Data produced by structural alignment 15
References 15
2
Pharmaceutical Bioinformatics, 7.5p
Lecture notes
Sequence Analysis
The fundamental building blocks of life are proteins. Enzymes, which are the
molecular machines responsible for virtually all of the chemical transformations that
cells are capable of, are proteins. In addition, much of the structure of a cell is made
up of proteins. That part of the structure which is not made up of proteins is produced
by enzymes which are proteins. A human contains on the order of 100,000 different
proteins. It is the properties of and the interactions between these 100,000 proteins
that make us what we are.
Proteins are variable length linear, mixed polymers of 20 different amino acids. Other
terms used more or less interchangeably for amino acid polymers are peptides and
polypeptides. These topologically linear polymers fold upon themselves to generate a
shape characteristic of each different protein, and this shape along with the different
chemical properties of the 20 amino acids determine the function of the protein. One
of the most important concepts in modern biology is that the functional properties of
proteins is determined largely by the sequence of the 20 amino acids in the linear
polypeptide chain; that in many cases proteins are largely self-folding. Thus, in
theory, knowing the sequence of a protein (the order with which the amino acids
occurred) one could infer its function.
What determines the order of amino acids in a protein? The Central Dogma of
Molecular Biology describes how the genetic information we inherit from our parents
is stored in DNA, and that information is used to make identical copies of that DNA
and is also transferred from DNA to RNA to protein. DNA is a linear polymer of 4
nucleotides [4] deoxyAdenosine monophosphate (abbreviated A), deoxyThymidine
monophosphate (abbreviated T), deoxyGuanosine monophosphate (abbreviated G)
and deoxyCytidine monophosphate (abbreviated C). RNA is a very similar polymer
of Adenosine monophosphate, Guanosine monophosphate, Cytidine monophosphate,
and Uridine monophosphate. Uridine monophosphate, abbreviated U, is a nucleotide
functionally equivalent to Thymidine monophosphate.
A property of both DNA and RNA is that the linear polymers can pair one with
another, such pairing being sequence specific. In such double polymers (referred to as
a "double helix" due to the shape they assume) G pairs with C and A pairs with T or
U. All possible combinations of DNA and RNA double helices occur. One strand
DNA can serve as a template for the construction of a complementary strand, and this
complementary strand can be used to recreate the original strand. This is the basis of
DNA replication and thus all of genetics. Similar templating results in an RNA copy
of a DNA sequence. Conversion of that RNA sequence into a protein sequence is
more complex. This occurs by translation of a code consisting of three nucleotides
into one amino acid, a process accomplished by cellular machinery including tRNA
and ribosomes.
3
Pharmaceutical Bioinformatics, 7.5p
Lecture notes
Four different nucleotides taken three at a time can result in 64 different possible
triplet codes; more than enough to encode 20 amino acids. The way that these 64
codes are mapped onto 20 amino acids is first, that one amino acid may be encoded
by 1 to 6 different triplet codes, and second, that 3 of the 64 codes, called stop codons,
specify "end of peptide sequence". Where multiple codons specify the same amino
acid, the different codons are used with unequal frequency and this distribution of
frequency is referred to as "codon usage". Codon usage varies between species.
The fact that DNA nucleotides need to be read three at a time to specify a protein
sequence implies that a DNA sequence has three different reading frames determined
by whether you start at nucleotide one, two, or three. (Nucleotide four will be in the
same frame as nucleotide one and so on). Both strands of DNA can be copied into
RNA (for translation into protein). Thus, a DNA sequence with its (inferred)
complementary strand can specify six different reading frames.
If you have just determined a sequence of an interesting bit of DNA, one of the first
questions you are likely to ask yourself is "has anybody else seen anything like this?"
Fortunately, there has been a very successful international effort to collect all the
sequences people have determined in one place so they can be searched. For DNA
sequences, three groups have cooperated in this effort, one in Japan, one in Europe,
and one in the United States to produce DDBJ, EMBL and GenBank, respectively.
These databases are frequently reconciled with each other, so that searching any one
is virtually the same as searching all three. The problem is that these databases are
HUGE and, as a result, you must compare your sequence with this vast number of
other sequences efficiently. A number of programs have been written to rapidly
search a database for a query sequence, two of which, BLAST and FASTA, will be
discussed in this course. The techniques used by these programs to make searching
rapid result in some loss of rigor of comparison. It is possible (although, as it turns
out, unlikely) that a weak but relevant similarity could be missed by these programs.
In addition, many times these programs will flag a sequence as being similar to your
query sequence when this similarity is not significant. Thus, these programs should be
seen as tools for identifying a small subset of sequences from the database for
retrieval and further analysis rather than ends in themselves.
Databases of protein sequences, including Uniprot and PIR, also exist and can
similarly be searched.
4
Pharmaceutical Bioinformatics, 7.5p
Lecture notes
Which program should you use to search a database, FASTA or BLAST? This
question is about as controversial as that over choices of computers (Mac vs. PC) or
religions. In fact, as you enter the world of sequence analysis, you will find religous
wars between proponents of different programs over and over. Worse, new programs
are constantly appearing. In addition, even after having selected a program, you will
frequently have to select values for "parameters" and always have to interpret the
output. There are no magic answers to help you do these things
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the primary sequences
of DNA, RNA, or protein to identify regions of similarity that may be a consequence
of functional, structural, or evolutionary relationships between the sequences. Aligned
sequences of nucleotide or amino acid residues are typically represented as rows
within a matrix. Gaps are inserted between the residues so that residues with identical
or similar characters are aligned in successive columns.
A sequence alignment, produced by ClustalW between two human zinc finger proteins
identified by GenBank accession number.
Very short or very similar sequences can be aligned by hand; however, most
interesting problems require the alignment of lengthy, highly variable or extremely
numerous sequences that cannot be aligned solely by human effort. Instead, human
knowledge is primarily applied in constructing algorithms to produce high-quality
sequence alignments, and occasionally in adjusting the final results to reflect patterns
5
Pharmaceutical Bioinformatics, 7.5p
Lecture notes
Multiple sequence alignment also refers to the process of aligning such a sequence
set. Because three or more sequences of biologically relevant length are nearly
impossible to align by hand, computational algorithms are used to produce and
analyze the alignments. MSAs require more sophisticated methodologies than
pairwise alignment because they are more computationally complex to produce. Most
multiple sequence alignment programs use heuristic methods rather than global
optimization because identifying the optimal alignment between more than a few
sequences of moderate length is prohibitively computationally expensive.
6
Pharmaceutical Bioinformatics, 7.5p
Lecture notes
Sequences can be aligned across their entire length (global alignment) or only in
certain regions (local alignment). This is true for pairwise and multiple alignments.
Global alignments need to use gaps (representing insertions/deletions) while local
alignments can avoid them, aligning regions between gaps.
Some programs give quantitative measures for the significance of the alignment.
These are usually based on the chance occurrence of such alignments and depend on
the size and composition of the aligned sequences. Empirical measures are also
extremely useful for deciding the 'correctness' of the multiple alignment. Consistency
is a powerful measure for correct multiple alignments. If the same alignment is found
in the sequence-to-sequence searches and various multiple alignment methods it is
most probably correct. One pitfall to avoid is biased sequence composition that may
lead to trivial alignments.
7
Pharmaceutical Bioinformatics, 7.5p
Lecture notes
expect the sites to be aligned together and may 'force' that alignment. Such manual
alignments can serve as a seed to an alignment with more sequences.
Local multiple alignments (blocks) from different programs can be joined or used
together. Another approach is 'divide and conquer'. Blocks present in all sequences
divide them into separate parts, in each of which more blocks can be searched for.
BLAST
BLAST is an acronym for Basic Local Alignment Search Tool, and it consists of a set
of algorithms for comparing biological sequences such as nucleotides or protein
sequences. A nucleotide sequence is nothing but a DNA (or part of) sequence
expressed as a long string of 4 characters: A,T,C and G. They stand for Adenine,
Guanine, Cytosine and Thymine. So, every nucleotide sequence consists of only these
four characters arranged in different orders.
BLAST allows you to compare your sequence against a database of sequences and
informs you if your sequence matches any of the sequences in the database, along
with a lot of information like:
8
Pharmaceutical Bioinformatics, 7.5p
Lecture notes
So, now that you know BLAST can be used to align two sequences and to study the
similarity between two or more sequences, let us look into the principles of sequence
alignment briefly.
Sequence alignment refers to arranging two sequences in an order such that their
similar portions are highlighted.
For ex:
AGCTATGGGCAAATTTGGAACAAACCAAAAAGT
........ ........ ...............
AGCTATGGACAAATTTGCAACAAACCAAAAAGT
The portions in the sequence which do not match are shown by gaps in the alignment.
Global Alignment: It refers to the alignment in which all the characters in both
sequences participate in the alignment.
BLAST flavours
The BLAST programs are widely used tools for searching DNA and protein databases
for sequence similarity to identify homologs to a query sequence. While often referred
to as just "BLAST", this can really be thought of as a set of programs: blastp, blastn,
blastx, tblastn, and tblastx.
blastp
o Compares an amino acid query sequence against a protein sequence
database
blastn
o Compares a nucleotide query sequence against a nucleotide sequence
database
blastx
o Compares the six-frame conceptual translation products of a nucleotide
query sequence (both strands) against a protein sequence database
tblastn
9
Pharmaceutical Bioinformatics, 7.5p
Lecture notes
Clustal
Clustal is a fully automatic program for global multiple alignment of DNA and
protein sequences. The alignment is progressive and considers the sequence
redundancy. Trees can also be calculated from multiple alignments (see below). The
program has some adjustable parameters with reasonable defaults. ClustalW is
available on the WWW and for various computer operating systems.
3. Combine the alignments from 1 in the order specified in 2 using the rule " once a
gap always a gap"
In stage 1:
10
Pharmaceutical Bioinformatics, 7.5p
Lecture notes
2. Using the matrix from 1.2.1. and Neighbor-Joining, Clustalw constructs the
similarity tree. The root is placed in the middle of the longest chain of consecutive
edges.
3. Combine the alignments, starting from the closest related groups (going form the
tips of the tree towards the root).
Viewing
Multiple alignments of many sequences and those with different sequence weights are
difficult to visualize. Sequence logos are a graphical way for presenting multiple
alignments.
ID ADH_IRON_1; BLOCK
AC BL00913C; distance from previous block=(56,76)
DE Iron-containing alcohol dehydrogenases proteins.
BL HHG motif; width=22; seqs=11; 99.5%=492; strength=1428
ADHE_CLOAB ( 720) CHSMAIKLSSEHNIPSGIANAL 66
11
Pharmaceutical Bioinformatics, 7.5p
Lecture notes
12
Pharmaceutical Bioinformatics, 7.5p
Lecture notes
Fig: A tree made from the three blocks in the iron containing alcohol dehydrogenases
family. Bootstrap values are for 100 trials. The tree was calculated from the blocks
with the ClustalW program and drawn with the TreeView program.
Searching
Multiple alignments are powerful tools for identifying new members of the aligned
group. It is possible to query databases of multiple alignments with single sequences
and to query sequence databases with multiple alignments. It has been shown that
such searches are more sensitive and selective than sequence-to-sequence searches. A
simple (but very effective !) 'hybrid' approach is to use a properly made consensus
sequence
13
Pharmaceutical Bioinformatics, 7.5p
Lecture notes
code and bias the primers toward some of the sequences. codehop primers were
shown more effective than simple degenerate primers in various cases.
Structural alignments
Structural alignment is a form of sequence alignment based on comparison of shape.
These alignments attempt to establish equivalences between two or more polymer
structures based on their shape and three-dimensional conformation. This process is
usually applied to protein tertiary structures but can also be used for large RNA
molecules. In contrast to simple structural superposition, where at least some
equivalent residues of the two structures are known, structural alignment requires no a
priori knowledge of equivalent positions. Structural alignment is a valuable tool for
the comparison of proteins with low sequence similarity, where evolutionary
relationships between proteins cannot be easily detected by standard sequence
alignment techniques. Structural alignment can therefore be used to imply
evolutionary relationships between proteins that share very little common sequence.
However, caution should be used in using the results as evidence for shared
evolutionary ancestry because of the possible confounding effects of convergent
evolution by which multiple unrelated amino acid sequences converge on a common
tertiary structure.
The outputs of a structural alignment are a superposition of the atomic coordinate sets
and a minimal root mean square distance (RMSD) between the structures. The RMSD
of two aligned structures indicates their divergence from one another. Structural
alignment can be complicated by the existence of multiple protein domains within one
or more of the input structures, because changes in relative orientation of the domains
between two structures to be aligned can artificially inflate the RMSD.
14
Pharmaceutical Bioinformatics, 7.5p
Lecture notes
Fig: Structural alignment of thioredoxins from humans and the fly Drosophila
melanogaster. The proteins are shown as ribbons, with the human protein in red, and
the fly protein in yellow. Generated from PDB 3TRX and 1XWC.
References
http://en.wikipedia.org/wiki/Structural_alignment
http://en.wikipedia.org/wiki/Sequence_alignment_software
http://en.wikipedia.org/wiki/Multiple_sequence_alignment
http://puneetwadhwa.blogspot.com/2005/10/introduction-to-blast-basic-local.html
http://en.wikipedia.org/wiki/BLAST
15