Multiple Seq Alignment
Multiple Seq Alignment
Homology: Definition
Homology: similarity that is the result of inheritance from a common ancestor
Paralogs - related genes within an organism Orthologs genes in other species An Alignment is an hypothesis of positional homology between bases/ amino acids.
Easy
Manual Alignment
Alignment is easy. There is some extraneous information. Automated alignment methods have encountered the local minimum problem. An automated alignment method can be improved.
Pairwise Alignment
Aligning two sequences : GATTC & GAATTC 1 Scoring: matches: +1 mismatches: 0 indel: -1 1
-1
1 1 1
GATTC GAATTC
Score = 2
GATTC GAATTC
Score = 4
Progressive Alignment
Devised by Feng and Doolittle in 1987. Heuristic method, as such, is not guaranteed to find the optimal alignment. Based on the pairwise alignment. Most successful implementation is Clustal (by Des Higgins)
ClustalW
ClustalW - Introduction
. General purpose is the comparison or alignment of DNA or protein sequences. . Biologists can study the sequence patterns conserved through evolution and ancestral relationship between different organisms. . Clustalw can be displayed on different operating systems, including: WinXP, UNIX (Linux), Macintosh. . The first Clustal programme (1988) by Des Higgins ClustalV (1992) ClustalW (1994) ClustalX
Unrooted Neighbour-Joining tree Rooted NJ tree (guide tree) and sequence weights
Pairwise Alignment
. Pairwise aligns each sequence with every the others
for example: there are n sequences
n(n 1) n C2 2
pairwise alignments were calculated.
. accurate scores from full dynamic programming alignment using 2 gap penalties (opening and extending ) a full amino acid weight matrix
. Each pairwise alignment is completely independent
Both of the scores (gap penalties and amino acid weight matrix) are initially calculated as per cent identity scores and are converted to distances by dividing by 100 and subtracting from 1.0 to give number of differences per site.
Unrooted Neighbour-Joining tree Rooted NJ tree (guide tree) and sequence weights
Generate a Neighbor-Joining guide tree from these pairwise distance. This guide tree gives the order in which the progressive alignment will be carried out.
Unrooted Neighbour-Joining tree Rooted NJ tree (guide tree) and sequence weights
The weights are dependent upon the distance from the root of the tree but sequences which have a common branch with other sequences share the weight derived from the shared branch.
Progressive Alignment
Align the two most closely-related sequences first. This alignment is then fixed and will never change. Once gap, always gap.
Summary
There are three main stages for ClustalW
Higgins D., Thompson J., Gibson T.Thompson J.D., Higgins D.G., Gibson T.J.(1994). Nucleic Acids Res. 22:4673-4680.
ClustalW spends around 96% running time in the first stage for pairwise alignment of the n sequences; and the rest is the running time for second and third stages.
ClustalW
Main menu
Input file
Input File
Prepare the input file sequences should be all in one file there are 7 formats can be accepted : NBRF/PIR, EMBL/Swissport, Fasta, GDE, Clustal, GCG/MSF, RSF
Main Menu
Multiple alignment menu 1. Do complete multiple alignment now (slow/fast) 2. Produce guide tree only 3. Do alignment using old guide tree file 4. Slow / fast pairwise alignment 5. Pairwise alignment parameter 6. Multiple alignment parameter 7. Reset gaps before alignemnt 8. Screen display 9. Output format option 1. Sequence input from disk 2. Multiple alignment 3. Profile / structure alignment
4. Phylogenetic tree
Profile / Structure alignment 1. Input 1st. profile 2. Input 2nd. profile / sequence 3. Align 2nd. profile to 1st. profile 4. Align sequences to 1st. profile Phylogenetic tree 1. Input alignment 2. Exclude position with gaps 3. Correct for multiple substitutions 4. Draw tree now 5. Bootstrap tree
. Gap Open Penalty: the penalty for opening a gap. (initial gap penalty)
. Gap Extension Penalty: the penalty for extending a gap by 1 residue. ACGTAAATTTTTGG ACGT - - - - - -TTGG
GOP GEP
increase for speed (max=2 for protein, 4 for DNA); decrease for
sensitivity . Top Diagonals: the number of K-Tuple matches on each diagonal (most matches)
Output File
CLUSTAL output : [filename].aln