Multiple Sequences Alignment
Homology: Definition
 Homology: similarity that is the result of inheritance from a common ancestor
Paralogs - related genes within an organism Orthologs  genes in other species  An Alignment is an hypothesis of positional homology between bases/ amino acids.
Why are multiple sequences alignment used?
Related protein can often provide the likely function, structure, and evolution. Multiple alignment is more sensitive than pairwise alignment to detect homologs. Revealed conserved residues or motifs. Database search effectively perform multiple sequences alignment. The regulatory region of many genes contain consensus sequences for transcription factor-binding site.
Information in Multiple Alignment
Conserved regions  Region that are invariant in all the alignment.  These usually indicate regions with a specific function.  Can be totally or partially conserved. Phylogenetic analysis  Tell you which sequences are closest.  Sequences are arranged from the most closely related to the most distantly related.
Multiple Sequences Alignment -- Goal
 To generate a concise, information-rich summary of sequence data.  Used to illustrate the similarity between a group of sequences.  Used to illustrate the dissimilarity between a group of sequences.  Alignment can be treated as models that can be used to test hypotheses.
Alignment can be easy or difficult
Easy
Difficult : due to the insertions or deletions
The Methods of Multiple Sequences Alignment
Multiple Sequences Alignment - methods
 Methods of solving the Multiple Alignment Problem  Manual  Dynamic Programming  Hidden Markov Models (HMMs)  Progressive Alignment
Manual Alignment
 Alignment is easy.  There is some extraneous information. Automated alignment methods have encountered the local minimum problem.  An automated alignment method can be improved.
Dynamic Programming Alignment
Dynamic Programming  Consider 2 protein sequences of 100 amino acids in length.  If it takes 100 seconds to completely align these sequences, it will takes 100 seconds to align 3 sequences, and then 4 sequences  etc.  It will takes 1.90258x1034 years to align 20 sequences completely.
 Limited to a small number of sequences.
Pairwise Alignment
Aligning two sequences : GATTC & GAATTC 1  Scoring: matches: +1 mismatches: 0 indel: -1 1
-1
1 1 1
GATTC GAATTC
Score = 2
GATTC GAATTC
Score = 4
Hidden Markov Models
HMMER was written by Sean Eddy. http://hmmer.wustl.edu Running on UNIX platform. Probabilistic models. Described the likelihood that an amino acid residue occurs at each given position of an alignment. Two main uses  search a sequence database with a single profile HMM.  search a single query sequence against a library of HMMs.
Progressive Alignment
 Devised by Feng and Doolittle in 1987.  Heuristic method, as such, is not guaranteed to find the optimal alignment.  Based on the pairwise alignment.  Most successful implementation is Clustal (by Des Higgins)
ClustalW
ClustalW - Introduction
. General purpose is the comparison or alignment of DNA or protein sequences. . Biologists can study the sequence patterns conserved through evolution and ancestral relationship between different organisms. . Clustalw can be displayed on different operating systems, including: WinXP, UNIX (Linux), Macintosh. . The first Clustal programme (1988) by Des Higgins ClustalV (1992) ClustalW (1994) ClustalX
. The latest version is ClustalW 1.83
ClustalW  download & WWW
Download
 http://www.imtech.res.in/pub/mirror_sites/ebi/dos/clustalw/  http://iubio.bio.indiana.edu/soft/iubionew/molbio/dna/analysis/ClustalW/  ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/ WWW  http://www.ebi.ac.uk/clustalw (version 1.83)
Three main stages for ClustalW :
Pairwise alignment: Calculate distance matrix
Unrooted Neighbour-Joining tree Rooted NJ tree (guide tree) and sequence weights
Progressive alignment: Align following the guide tree
Pairwise Alignment
. Pairwise aligns each sequence with every the others
for example: there are n sequences
n(n  1) n C2  2
pairwise alignments were calculated.
. accurate scores from full dynamic programming alignment  using 2 gap penalties (opening and extending )  a full amino acid weight matrix
. Each pairwise alignment is completely independent
Calculate distance matrix
Both of the scores (gap penalties and amino acid weight matrix) are initially calculated as per cent identity scores and are converted to distances by dividing by 100 and subtracting from 1.0 to give number of differences per site.
Three main stages for ClustalW :
Pairwise alignment: Calculate distance matrix
Unrooted Neighbour-Joining tree Rooted NJ tree (guide tree) and sequence weights
Progressive alignment: Align following the guide tree
Guide Tree  unroot NJ tree
0.17 0.13
 Generate a Neighbor-Joining guide tree from these pairwise distance.  This guide tree gives the order in which the progressive alignment will be carried out.
Three main stages for ClustlaW :
Pairwise alignment: Calculate distance matrix
Unrooted Neighbour-Joining tree Rooted NJ tree (guide tree) and sequence weights
Progressive alignment: Align following the guide tree
Guide Tree  root NJ tree
The weights are dependent upon the distance from the root of the tree but sequences which have a common branch with other sequences share the weight derived from the shared branch.
Three main stages for ClustalW :
Pairwise alignment: Calculate distance matrix Unrooted Neighbour-Joining tree Rooted NJ tree (guide tree) and sequence weights
Progressive alignment: Align following the guide tree
Progressive Alignment
 Align the two most closely-related sequences first.  This alignment is then fixed and will never change.  Once gap, always gap.
Summary
 There are three main stages for ClustalW
Higgins D., Thompson J., Gibson T.Thompson J.D., Higgins D.G., Gibson T.J.(1994). Nucleic Acids Res. 22:4673-4680.
 ClustalW spends around 96% running time in the first stage for pairwise alignment of the n sequences; and the rest is the running time for second and third stages.
Perform ClustalW alignment
ClustalW
Main menu
Input file
Input File
 Prepare the input file  sequences should be all in one file  there are 7 formats can be accepted : NBRF/PIR, EMBL/Swissport, Fasta, GDE, Clustal, GCG/MSF, RSF
 edit the file by Notepad  for example :
Fasta is the common
Main Menu
Multiple alignment menu 1. Do complete multiple alignment now (slow/fast) 2. Produce guide tree only 3. Do alignment using old guide tree file 4. Slow / fast pairwise alignment 5. Pairwise alignment parameter 6. Multiple alignment parameter 7. Reset gaps before alignemnt 8. Screen display 9. Output format option 1. Sequence input from disk 2. Multiple alignment 3. Profile / structure alignment
4. Phylogenetic tree
Profile / Structure alignment 1. Input 1st. profile 2. Input 2nd. profile / sequence 3. Align 2nd. profile to 1st. profile 4. Align sequences to 1st. profile Phylogenetic tree 1. Input alignment 2. Exclude position with gaps 3. Correct for multiple substitutions 4. Draw tree now 5. Bootstrap tree
Toggle slow/fast pairwise alignment
 Slow/accurate alignment  It is fine for short sequences.  If sequences>100, length >1000, the speed will be extremely slow  full dynamic programming.  Fast/approximate alignment how to be fast: - only exactly matching fragments - only the best diagonal
Pairwise Alignment Parameter  (1)
 Slow alignment:
. Gap Open Penalty: the penalty for opening a gap. (initial gap penalty)
. Gap Extension Penalty: the penalty for extending a gap by 1 residue. ACGTAAATTTTTGG ACGT - - - - - -TTGG
GOP GEP
. Protein Weight Matrix: Gonnet, BLOSUM, PAM
. DNA Weight Matrix: assigned to matches and mismatches
For example: Gonnet BLOSUM PAM Scoring Matrix
Pairwise Alignment Parameters  (2)
 Fast alignmnet
. K-Tuple Size: the size of exactly matching fragment
increase for speed (max=2 for protein, 4 for DNA); decrease for
sensitivity . Top Diagonals: the number of K-Tuple matches on each diagonal (most matches)
decrease for speed; increase for sensitivity
. Window size: the number of diagonals around each of the best diagonals
decrease for speed; increase for sensitivity
Multiple Alignment Parameter
. increase the Gap Opening Penalty will make gaps less frequent. . increase the Gap Extension Penalty will make gaps shorter. . Delay Divergent Sequences: for delaying the alignment of the most distantly related sequences until most closely related sequences have aligned. . DNA Transition Weight: give the score of AG, CT, between 0 or 1  0  mismatches; 1 matches. for distantly related DNA sequences, the weight is approximately 0 for closely related DNA sequences, the weight has higher score. . Protein Weight Matrix: how similar the sequences to be aligned at this alignment step are.
Output File
CLUSTAL output : [filename].aln
 GUIDE TREE : [filename].dnd