0% found this document useful (0 votes)
10 views

Multiple Sequence Alignment Part 1

This document summarizes key concepts from a chapter about multiple sequence alignment (MSA) in bioinformatics. MSA aligns multiple related DNA or protein sequences to identify conserved patterns and motifs. It allows phylogenetic analysis to study evolutionary relationships and can help predict protein structure. The document discusses scoring functions, exhaustive and heuristic MSA algorithms like progressive alignment, and applications like PCR primer design.

Uploaded by

letsvansh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Multiple Sequence Alignment Part 1

This document summarizes key concepts from a chapter about multiple sequence alignment (MSA) in bioinformatics. MSA aligns multiple related DNA or protein sequences to identify conserved patterns and motifs. It allows phylogenetic analysis to study evolutionary relationships and can help predict protein structure. The document discusses scoring functions, exhaustive and heuristic MSA algorithms like progressive alignment, and applications like PCR primer design.

Uploaded by

letsvansh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

CS-434 Bioinformatics

Dr. Urooj Ainuddin


Multiple
Sequence
Alignment
Chapter 5
Introduction

• A natural extension of pairwise alignment is multiple sequence


alignment (MSA), which is to align multiple related sequences to
achieve optimal matching of the sequences.
• It arranges sequences in such a way that evolutionarily equivalent
positions across all sequences are matched.
Motivation to align multiple sequences

• MSA reveals much more biological information than many pairwise alignments can.
• It allows the identification of conserved sequence patterns and motifs in the
whole sequence family, which are not obvious to detect by comparing only two sequences.
• Many conserved and functionally critical amino acid residues can be identified in a protein
multiple alignment.
• MSA is an essential prerequisite to carrying out phylogenetic analysis of sequence families
and prediction of protein secondary and tertiary structures.
• Multiple sequence alignment also has applications in designing polymerase chain reaction
(PCR) primers based on multiple related sequences.
Phylogenetic analysis

• Phylogenetic analysis
is the study of the
evolutionary
development of a
species or a particular
characteristic of an
organism.
Phylogenetic analysis

• Branching diagrams
are made to represent
the relationship
between different
species, organisms,
or characteristics of
an organism (genes,
proteins, organs, etc.) that
are developed from a
common ancestor.
• The diagram is known as a
phylogenetic tree.
Protein structure

• The complete structure of a protein can be described at four different levels


of complexity:
1. Primary,
2. Secondary,
3. Tertiary,
4. Quaternary.
• Primary structure is defined as the linear amino
acid sequence of a protein's polypeptide chain.
• The term protein sequence is often used
interchangeably with primary structure.

Primary • For example, the primary structure of human


hemoglobin is:
structure
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVY
PWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAF
SDGLAHLDNLKGTFATLSELHCDKLHVDPENFR
Secondary structure

• Secondary structure is
defined as the local
spatial conformation
of the polypeptide
backbone excluding
the side chains.
Secondary structure

• Secondary
structure elements
common to many
proteins include
α‐helices
and β‐sheets.
Tertiary structure

• Tertiary structure refers to


the 3d arrangement of all
the atoms that constitute a
protein molecule.
• It relates the precise spatial
coordination of secondary
structure elements and the
location of all functional
groups of a single
polypeptide chain.
Quaternary structure

• Quaternary structure is the


3d structure consisting of
the aggregation of two or
more individual
polypeptide chains
(subunits) that operate as a
single functional unit
(multimer).
Polymerase Chain Reaction (PCR)

• Polymerase chain reaction is a method widely used to rapidly make millions to


billions of copies (complete or partial) of a specific DNA sample, allowing scientists
to take a very small sample of DNA and amplify it (or a part of it) to a large enough
amount to study in detail.
• Short DNA sequences called primers are used to select the portion of the genome
to be amplified.
• The temperature of the sample is repeatedly raised and lowered to help a DNA
replication enzyme copy the target DNA sequence.
• Billions of copies of the target sequence are created in just a few hours.
Scoring Function

• MSA is to arrange sequences in such a way that a


maximum number of residues from each sequence are matched
up according to a particular scoring function.
• The scoring function for multiple sequence alignment is based
on the concept of sum of pairs (SP).
• It is the sum of the scores of all possible pairs of sequences in a
multiple alignment based on a particular scoring matrix.
Scoring Function
• In calculating the SP scores, each column is scored by summing
the scores for all possible pairwise matches, mismatches and gap
costs.
• The score of the entire alignment is the sum of all the column
scores.
• The purpose of most multiple sequence alignment algorithms is
to achieve maximum SP scores.
Scoring multiple nucleotide sequences

• We have three sequences: ATT, AT and ACAT.


• We produce the following MSA, where match=2, mismatch=-1, gap penalty=-2:
• Sequence 1 A_TT
• Sequence 2 A_T_
• Sequence 3 ACAT
• The score the first column is 2+2+2=6, that of the second column is 0-2-2=-4, that
of the third column is 2-1-1=0, that of the fourth column is -2-2+2=-2.
• The SP score of this MSA is 6-4+0-2=0.
Scoring multiple amino acid sequences

• We have three sequences: GKN, TRN and SHE.


• We produce the following MSA, using BLOSUM62, gap penalty=-8:
• Sequence 1 GKN
• Sequence 2 TRN
• Sequence 3 SHE
• The score the first column is GT+TS+GS=-2+1+0=-1, that of the second column is
KR+RH+KH=2+0-1=1, that of the third column is NN+NE+NE=6+0+0=6.
• The SP score of this MSA is -1+1+6=6.
• The exhaustive alignment method
involves examining all possible aligned
positions simultaneously.
• For aligning N sequences, a N-
Exhaustive dimensional matrix is needed to be filled
algorithms with alignment scores.
• Back-tracking is applied through the N-
dimensional matrix to find the highest
scored path that represents the optimal
alignment.
Limitations of exhaustive
algorithms
• As the amount of computational time and memory space
required increases exponentially with the number of
sequences, it makes the method computationally
prohibitive to use for a large data set.
• Full dynamic programming is limited to small datasets of
less than ten short sequences.
• Few multiple alignment programs employing DP are
publicly available.
Divide and Conquer Alignment (DCA)
https://bibiserv.cebitec.uni-bielefeld.de/dca

• DCA is a web-based program that is in fact semi-exhaustive because certain


steps of computation are reduced to heuristics.
• It works by breaking each of the sequences into two smaller sections. If the
sections are not short enough, further divisions are carried out.
• When the lengths of the sequences reach a predefined threshold, dynamic
programming is applied for aligning each set of subsequences.
Divide and Conquer Alignment (DCA)
https://bibiserv.cebitec.uni-bielefeld.de/dca

• The resulting short alignments are joined together head to tail to yield a
multiple alignment of the entire length of all sequences.
• It performs global alignment and requires the input sequences to be of
similar lengths and domain structures.
• Despite the use of heuristics, the program is still extremely computationally
intensive and can handle only a very limited number of sequences.
• The heuristic algorithms for MSA fall
into three categories:
Heuristic 1. Progressive alignment,
algorithms
2. Iterative alignment, and
3. Block-based alignment.
Progressive alignment

• Progressive alignment depends on the stepwise assembly of multiple


alignment and is heuristic in nature.
• It speeds up the alignment of multiple sequences through a multistep
process.
Progressive alignment:
First step

• It conducts pairwise
alignments using the
Needleman–Wunsch
method and records
similarity scores
(based on a particular
substitution matrix)
from the pairwise
comparisons.
Progressive alignment:
First step

• The scores are


converted into
evolutionary distances
to generate a distance
matrix.
• The greater the
similarity, the smaller
the distance.
Progressive alignment:
Second step

• A guide tree is
generated from the
distance matrix.
• The tree reflects
evolutionary proximity
among all the
sequences.
Construction of the guide tree:
UPGMA method
• Unweighted Pair Group Method using Arithmetic average (UPGMA) is the
simplest clustering method.
• The basic assumption of the UPGMA method is that all taxa evolve at a
constant rate and that they are equally distant from the root.
• However, real data rarely meet this assumption.
• Thus, UPGMA often produces erroneous tree topologies.
• However, it is a fast method.
A B C D
A -
Construction of the guide tree: B 0.4 -
UPGMA method C 0.35 0.45 -
D 0.6 0.7 0.55 -

0.175
A
U
1. Join the two closest nodes. These are A and C, with distance
0.175
C
0.35.
2. This new node is U.
3. The branch length from U to A or from U to C is 0.35/2 = 0.175.
A B C D
A -
Construction of the guide tree: B 0.4 -
UPGMA method C 0.35 0.45 -
D 0.6 0.7 0.55 -

1. BU = (BA + CB) / 2 = (0.4 + 0.45) / 2. U B D


U -
2. DU = (DA + DC) / 2 = (0.6 + 0.55) / 2. B 0.425 -
3. Join the two closest nodes. These are B and U, D 0.575 0.7 -
with distance 0.425.
0.175
4. This new node is V. U
A
C
5. The branch length from V to B is 0.425/2 = 0.212. V 0.175
B
0.212
A B C D
A -
Construction of the guide tree: B 0.4 -
UPGMA method C 0.35 0.45 -
D 0.6 0.7 0.55 -

V D
1. DV = (DA + DB + DC) / 3 = (0.6 + 0.7 + 0.55) / 3. V -
2. Join D and V. D 0.617 -
0.175
3. This new node is W. A
U
4. The branch length from D to W is 0.617/2 = 0.309. V 0.175
C

5. The guide tree is now complete. W 0.212


B

D
0.309
A B C D
A -
Construction of the guide tree: B 0.4 -
UPGMA method C 0.35 0.45 -
D 0.6 0.7 0.55 -

• This guide tree will be used in progressive alignment.


• Observe the distance matrix represented by the guide 0.175
A
tree. You can see that it is different from the original
C
matrix. 0.175
A B C D B
A - 0.212
B 0.424 - D
0.309
C 0.35 0.424 -
D 0.618 0.618 0.618 -
Construction of the guide tree:
Neighbor Joining method
• The Neighbor Joining (NJ) method can be used to build a guide tree.
• The NJ method does not assume the taxa to be equidistant from the root.
• It corrects for unequal evolutionary rates between sequences using a
conversion step, which calculates “r-values” and “transformed r-values” using
the following formulae, 𝑛 being the number of nodes:

𝑟𝑖 = ෍ 𝑑𝑖𝑗

𝑟𝑖′ = 𝑟𝑖 /(𝑛 − 2)
Construction of the guide tree:
Neighbor Joining method
• Once the “r-values” and “transformed r-values” are in hand, and 𝑑𝑖𝑗 is the
evolutionary distance between 𝑖 and 𝑗, the converted distance between 𝑖 and 𝑗 is
given as:

𝑟𝑖 + 𝑟𝑗
𝑑𝑖𝑗 = 𝑑𝑖𝑗 − = 𝑑𝑖𝑗 − 𝑟𝑖′ − 𝑟𝑗′
𝑛−2
• Before tree construction, all possible nodes are collapsed into a star tree.
• The pair of taxa with the shortest distances in the new matrix are separated from
the star tree first, according to the corrected distances.
Construction of the guide tree:
Neighbor Joining method
while n > 2 {
Calculate 𝑟𝑖 and 𝑟𝑖′ .

Calculate 𝑑𝑖𝑗 .
Create the corrected distance matrix.
Join the two nodes which are the closest, S and T. Let the new joined node be X.
Calculate distances from X to S and T.
Calculate distances from X to the remaining nodes and create a reduced matrix using the previous matrix.
}
if n == 2
Join the two nodes and calculate the distance between them.
A B C D
A -
Construction of the guide tree: B 0.4 -
NJ method C 0.35 0.45 -
D 0.6 0.7 0.55 -

• Calculate 𝑟𝑖 and 𝑟𝑖′ .


A B C D
A -
Construction of the guide tree: B 0.4 -
NJ method C 0.35 0.45 -
D 0.6 0.7 0.55 -

• ′
Calculate 𝑑𝑖𝑗 = 𝑑𝑖𝑗 −
𝑟𝑖 +𝑟𝑗
𝑛−2
= 𝑑𝑖𝑗 − 𝑟𝑖′ − 𝑟𝑗′ .

• 𝑑𝐴𝐵 = 𝑑𝐴𝐵 − 𝑟𝐴′ − 𝑟𝐵′ = 0.4 − 0.675 − 0.775 = −1.05

• 𝑑𝐴𝐶 = 𝑑𝐴𝐶 − 𝑟𝐴′ − 𝑟𝐶′ = 0.35 − 0.675 − 0.675 = −1

• 𝑑𝐴𝐷 = 𝑑𝐴𝐷 − 𝑟𝐴′ − 𝑟𝐷′ = 0.6 − 0.675 − 0.925 = −1

• 𝑑𝐵𝐶 = 𝑑𝐵𝐶 − 𝑟𝐵′ − 𝑟𝐶′ = 0.45 − 0.775 − 0.675 = −1

• 𝑑𝐵𝐷 = 𝑑𝐵𝐷 − 𝑟𝐵′ − 𝑟𝐷′ = 0.7 − 0.775 − 0.925 = −1

• 𝑑𝐶𝐷 = 𝑑𝐶𝐷 − 𝑟𝐶′ − 𝑟𝐷′ = 0.55 − 0.675 − 0.925 = −1.05
A B C D
A -
Construction of the guide tree: B 0.4 -
NJ method C 0.35 0.45 -
D 0.6 0.7 0.55 -

• Create the corrected distance matrix.



• 𝑑𝐴𝐵 = 𝑑𝐴𝐵 − 𝑟𝐴′ − 𝑟𝐵′ = 0.4 − 0.675 − 0.775 = −1.05 A B C D

• 𝑑𝐴𝐶 = 𝑑𝐴𝐶 − 𝑟𝐴′ − 𝑟𝐶′ = 0.35 − 0.675 − 0.675 = −1 A -

• 𝑑𝐴𝐷 = 𝑑𝐴𝐷 − 𝑟𝐴′ − 𝑟𝐷′ = 0.6 − 0.675 − 0.925 = −1
′ B -1.05 -
• 𝑑𝐵𝐶 = 𝑑𝐵𝐶 − 𝑟𝐵′ − 𝑟𝐶′ = 0.45 − 0.775 − 0.675 = −1
• ′
𝑑𝐵𝐷 = 𝑑𝐵𝐷 − 𝑟𝐵′ − 𝑟𝐷′ = 0.7 − 0.775 − 0.925 = −1 C -1 -1 -

• 𝑑𝐶𝐷 = 𝑑𝐶𝐷 − 𝑟𝐶′ − 𝑟𝐷′ = 0.55 − 0.675 − 0.925 = −1.05 D -1 -1 -1.05 -
A B C D
A -
Construction of the guide tree: B 0.4 -
NJ method C 0.35 0.45 -
D 0.6 0.7 0.55 -

• Join the two nodes which are the closest. We


have two pairs, A and B, and C and D.
A B C D
• We join A and B into a new node called U. A -
• Calculate distances from U to A and B. B -1.05 -
C -1 -1 -
D -1 -1 -1.05 -
A B C D
A -
Construction of the guide tree: B 0.4 -
NJ method C 0.35 0.45 -
D 0.6 0.7 0.55 -

• Join the two nodes which are the closest. We


have two pairs, A and B, and C and D.
A
• We join A and B into a new node called U. U
B
• Calculate distances from U to A and B.
A B C D
A -
Construction of the guide tree: B 0.4 -
NJ method C 0.35 0.45 -
D 0.6 0.7 0.55 -

• The new cluster allows the construction of a


reduced matrix.
U C D
• This starts with distances from the initial matrix. U -
• Calculate distances from U to the remaining nodes. C 0.2 -
D 0.45 0.55 -
U C D
U -
Construction of the guide tree: C 0.2 -
NJ method D 0.45 0.55 -

• Calculate 𝑟𝑖 and 𝑟𝑖′ .


U C D
U -
Construction of the guide tree: C 0.2 -
NJ method D 0.45 0.55 -

• ′
Calculate 𝑑𝑖𝑗 = 𝑑𝑖𝑗 −
𝑟𝑖 +𝑟𝑗
𝑛−2
= 𝑑𝑖𝑗 − 𝑟𝑖′ − 𝑟𝑗′ .

• 𝑑𝐶𝑈 = 𝑑𝐶𝑈 − 𝑟𝐶′ − 𝑟𝑈′ = 0.2 − 0.75 − 0.65 = −1.2

• 𝑑𝐷𝑈 = 𝑑𝐷𝑈 − 𝑟𝐷′ − 𝑟𝑈′ = 0.45 − 1 − 0.65 = −1.2

• 𝑑𝐶𝐷 = 𝑑𝐶𝐷 − 𝑟𝐶′ − 𝑟𝐷′ = 0.55 − 0.75 − 1 = −1.2
U C D
U -
Construction of the guide tree: C 0.2 -
NJ method D 0.45 0.55 -

• Create the corrected distance matrix.



• 𝑑𝐶𝑈 = 𝑑𝐶𝑈 − 𝑟𝐶′ − 𝑟𝑈′ = 0.2 − 0.75 − 0.65 = −1.2 U C D

• 𝑑𝐷𝑈 = 𝑑𝐷𝑈 − 𝑟𝐷′ − 𝑟𝑈′ = 0.45 − 1 − 0.65 = −1.2 U -

• 𝑑𝐶𝐷 = 𝑑𝐶𝐷 − 𝑟𝐶′ − 𝑟𝐷′ = 0.55 − 0.75 − 1 = −1.2 C -1.2 -
D -1.2 -1.2 -
U C D
U -
Construction of the guide tree: C 0.2 -
NJ method D 0.45 0.55 -

• Join the two nodes which are the closest. All


pairs have the same corrected distance. We pick
U C D
the pair, C and U.
U -
• We join C and U into a new node called V. C -1.2 -
• Calculate distances from V to C and U. D -1.2 -1.2 -
U C D
U -
Construction of the guide tree: C 0.2 -
NJ method D 0.45 0.55 -

• Join the two nodes which are the closest. All


pairs have the same corrected distance. We pick
the pair, C and U. U A

• We join C and U into a new node called V. V


B
C
• Calculate distances from V to C and U.
U C D
U -
Construction of the guide tree: C 0.2 -
NJ method D 0.45 0.55 -

• The new cluster allows the construction of a reduced


matrix.
V D
• This starts with distances from the previous matrix. V -
• Calculate distance from D to the remaining node. D 0.4 -
V D
V -
Construction of the guide tree: D 0.4 -
NJ method

• Because D is the last branch to be decomposed from the


star tree, we do not calculate 𝑟𝑖 and 𝑟𝑖′ as 𝑟𝑖′ is infinitely
large when n − 2 = 0.
U A
• Without 𝑟𝑖′ , we cannot calculate 𝑑𝑖𝑗′ . B
V
• Join the two nodes, D and V. C

• The guide tree is now complete. D


V D
V -
Construction of the guide tree: D 0.4 -
NJ method

• This guide tree will be used in progressive alignment.


• Observe the distance matrix represented by the guide
tree. You can see that it is the same as the original matrix. U A
B
A B C D V
C
A -
B 0.4 - D
C 0.35 0.45 -
D 0.6 0.7 0.55 -
Progressive alignment:
Third step

• According to the guide


tree, the two most
closely related
sequences are first re-
aligned using the
Needleman–Wunsch
algorithm.
Progressive alignment:
Third step

• To align additional
sequences, the two
already aligned
sequences are
converted to a
consensus sequence
with fixed gap
positions.
Progressive alignment:
Third step

• The consensus is then


treated as a single
sequence for the next
alignment.
• More distant
sequences are added
in accordance with
their relative positions
on the guide tree.
Progressive alignment:
Third step

• After realignment
with a new sequence
using dynamic
programming, a new
consensus is derived.
• The process is
repeated until all the
sequences are aligned.
Clustal Omega
https://www.ebi.ac.uk/Tools/msa/clustalo/

• Probably the most well-known progressive alignment program is Clustal.


• Clustal is a progressive multiple alignment program available either as a
standalone or online program.
• Clustal Omega has the widest variety of operating systems out of all the
Clustal tools.
Clustal Omega
https://www.ebi.ac.uk/Tools/msa/clustalo/
• Clustal does not rely on a single substitution matrix.
• Instead, it applies different scoring matrices when aligning sequences, depending
on degrees of similarity.
• The choice of a matrix depends on the evolutionary distances measured from the
guide tree.
• For example, for closely related sequences that are aligned in the initial steps,
Clustal automatically uses the BLOSUM62 or PAM120 matrix. When more
divergent sequences are aligned in later steps of the progressive alignment, the
BLOSUM45 or PAM250 matrices may be used instead.
Clustal Omega
https://www.ebi.ac.uk/Tools/msa/clustalo/

• Another feature of Clustal is the use of adjustable gap penalties that allow
more insertions and deletions in regions that are outside the conserved
domains, but fewer in conserved regions.
• In addition, gaps that are too close to one another can be penalized more
than gaps occurring in isolated loci.
Drawbacks of progressive alignment

• The progressive alignment method is not suitable for comparing sequences


of different lengths because it is based on global alignment.
• As a result of the use of affine gap penalties, long gaps are not allowed, and,
in some cases, this may limit the accuracy of the method. (Affine gap penalty
is calculated as A+BL, where A is the gap opening penalty, B is the gap
extension penalty and L is the length of the gap.)
• The final alignment result is influenced by the order of sequence addition.
Drawbacks of progressive alignment

• A major limitation is the dependence of the algorithm on pairwise alignment.


• Once gaps introduced in the early steps of alignment, they are fixed.
• Any errors made in these steps cannot be corrected and can propagate
throughout the entire alignment.
• The final alignment could be far from optimal.
• The problem can be more glaring when dealing with divergent sequences.
Homework A B C D E
A -

• For the given distance B 7 -


matrix, form guide C 15 9 -
trees using both
clustering algorithms D 11 7 12 -
discussed in this E 16 8 7 11 -
material.

You might also like