Multiple Sequence Alignment Part 1
Multiple Sequence Alignment Part 1
• MSA reveals much more biological information than many pairwise alignments can.
• It allows the identification of conserved sequence patterns and motifs in the
whole sequence family, which are not obvious to detect by comparing only two sequences.
• Many conserved and functionally critical amino acid residues can be identified in a protein
multiple alignment.
• MSA is an essential prerequisite to carrying out phylogenetic analysis of sequence families
and prediction of protein secondary and tertiary structures.
• Multiple sequence alignment also has applications in designing polymerase chain reaction
(PCR) primers based on multiple related sequences.
Phylogenetic analysis
• Phylogenetic analysis
is the study of the
evolutionary
development of a
species or a particular
characteristic of an
organism.
Phylogenetic analysis
• Branching diagrams
are made to represent
the relationship
between different
species, organisms,
or characteristics of
an organism (genes,
proteins, organs, etc.) that
are developed from a
common ancestor.
• The diagram is known as a
phylogenetic tree.
Protein structure
• Secondary structure is
defined as the local
spatial conformation
of the polypeptide
backbone excluding
the side chains.
Secondary structure
• Secondary
structure elements
common to many
proteins include
α‐helices
and β‐sheets.
Tertiary structure
• The resulting short alignments are joined together head to tail to yield a
multiple alignment of the entire length of all sequences.
• It performs global alignment and requires the input sequences to be of
similar lengths and domain structures.
• Despite the use of heuristics, the program is still extremely computationally
intensive and can handle only a very limited number of sequences.
• The heuristic algorithms for MSA fall
into three categories:
Heuristic 1. Progressive alignment,
algorithms
2. Iterative alignment, and
3. Block-based alignment.
Progressive alignment
• It conducts pairwise
alignments using the
Needleman–Wunsch
method and records
similarity scores
(based on a particular
substitution matrix)
from the pairwise
comparisons.
Progressive alignment:
First step
• A guide tree is
generated from the
distance matrix.
• The tree reflects
evolutionary proximity
among all the
sequences.
Construction of the guide tree:
UPGMA method
• Unweighted Pair Group Method using Arithmetic average (UPGMA) is the
simplest clustering method.
• The basic assumption of the UPGMA method is that all taxa evolve at a
constant rate and that they are equally distant from the root.
• However, real data rarely meet this assumption.
• Thus, UPGMA often produces erroneous tree topologies.
• However, it is a fast method.
A B C D
A -
Construction of the guide tree: B 0.4 -
UPGMA method C 0.35 0.45 -
D 0.6 0.7 0.55 -
0.175
A
U
1. Join the two closest nodes. These are A and C, with distance
0.175
C
0.35.
2. This new node is U.
3. The branch length from U to A or from U to C is 0.35/2 = 0.175.
A B C D
A -
Construction of the guide tree: B 0.4 -
UPGMA method C 0.35 0.45 -
D 0.6 0.7 0.55 -
V D
1. DV = (DA + DB + DC) / 3 = (0.6 + 0.7 + 0.55) / 3. V -
2. Join D and V. D 0.617 -
0.175
3. This new node is W. A
U
4. The branch length from D to W is 0.617/2 = 0.309. V 0.175
C
D
0.309
A B C D
A -
Construction of the guide tree: B 0.4 -
UPGMA method C 0.35 0.45 -
D 0.6 0.7 0.55 -
𝑟𝑖 = 𝑑𝑖𝑗
𝑟𝑖′ = 𝑟𝑖 /(𝑛 − 2)
Construction of the guide tree:
Neighbor Joining method
• Once the “r-values” and “transformed r-values” are in hand, and 𝑑𝑖𝑗 is the
evolutionary distance between 𝑖 and 𝑗, the converted distance between 𝑖 and 𝑗 is
given as:
′
𝑟𝑖 + 𝑟𝑗
𝑑𝑖𝑗 = 𝑑𝑖𝑗 − = 𝑑𝑖𝑗 − 𝑟𝑖′ − 𝑟𝑗′
𝑛−2
• Before tree construction, all possible nodes are collapsed into a star tree.
• The pair of taxa with the shortest distances in the new matrix are separated from
the star tree first, according to the corrected distances.
Construction of the guide tree:
Neighbor Joining method
while n > 2 {
Calculate 𝑟𝑖 and 𝑟𝑖′ .
′
Calculate 𝑑𝑖𝑗 .
Create the corrected distance matrix.
Join the two nodes which are the closest, S and T. Let the new joined node be X.
Calculate distances from X to S and T.
Calculate distances from X to the remaining nodes and create a reduced matrix using the previous matrix.
}
if n == 2
Join the two nodes and calculate the distance between them.
A B C D
A -
Construction of the guide tree: B 0.4 -
NJ method C 0.35 0.45 -
D 0.6 0.7 0.55 -
• ′
Calculate 𝑑𝑖𝑗 = 𝑑𝑖𝑗 −
𝑟𝑖 +𝑟𝑗
𝑛−2
= 𝑑𝑖𝑗 − 𝑟𝑖′ − 𝑟𝑗′ .
′
• 𝑑𝐴𝐵 = 𝑑𝐴𝐵 − 𝑟𝐴′ − 𝑟𝐵′ = 0.4 − 0.675 − 0.775 = −1.05
′
• 𝑑𝐴𝐶 = 𝑑𝐴𝐶 − 𝑟𝐴′ − 𝑟𝐶′ = 0.35 − 0.675 − 0.675 = −1
′
• 𝑑𝐴𝐷 = 𝑑𝐴𝐷 − 𝑟𝐴′ − 𝑟𝐷′ = 0.6 − 0.675 − 0.925 = −1
′
• 𝑑𝐵𝐶 = 𝑑𝐵𝐶 − 𝑟𝐵′ − 𝑟𝐶′ = 0.45 − 0.775 − 0.675 = −1
′
• 𝑑𝐵𝐷 = 𝑑𝐵𝐷 − 𝑟𝐵′ − 𝑟𝐷′ = 0.7 − 0.775 − 0.925 = −1
′
• 𝑑𝐶𝐷 = 𝑑𝐶𝐷 − 𝑟𝐶′ − 𝑟𝐷′ = 0.55 − 0.675 − 0.925 = −1.05
A B C D
A -
Construction of the guide tree: B 0.4 -
NJ method C 0.35 0.45 -
D 0.6 0.7 0.55 -
• ′
Calculate 𝑑𝑖𝑗 = 𝑑𝑖𝑗 −
𝑟𝑖 +𝑟𝑗
𝑛−2
= 𝑑𝑖𝑗 − 𝑟𝑖′ − 𝑟𝑗′ .
′
• 𝑑𝐶𝑈 = 𝑑𝐶𝑈 − 𝑟𝐶′ − 𝑟𝑈′ = 0.2 − 0.75 − 0.65 = −1.2
′
• 𝑑𝐷𝑈 = 𝑑𝐷𝑈 − 𝑟𝐷′ − 𝑟𝑈′ = 0.45 − 1 − 0.65 = −1.2
′
• 𝑑𝐶𝐷 = 𝑑𝐶𝐷 − 𝑟𝐶′ − 𝑟𝐷′ = 0.55 − 0.75 − 1 = −1.2
U C D
U -
Construction of the guide tree: C 0.2 -
NJ method D 0.45 0.55 -
• To align additional
sequences, the two
already aligned
sequences are
converted to a
consensus sequence
with fixed gap
positions.
Progressive alignment:
Third step
• After realignment
with a new sequence
using dynamic
programming, a new
consensus is derived.
• The process is
repeated until all the
sequences are aligned.
Clustal Omega
https://www.ebi.ac.uk/Tools/msa/clustalo/
• Another feature of Clustal is the use of adjustable gap penalties that allow
more insertions and deletions in regions that are outside the conserved
domains, but fewer in conserved regions.
• In addition, gaps that are too close to one another can be penalized more
than gaps occurring in isolated loci.
Drawbacks of progressive alignment