Sequences Alignments (Similarity & Homology)
Sequences Alignments (Similarity & Homology)
What is a sequence?
!B " ioinformatic : a biological sequence is a simple word ! "A word is a ordonned collection of symboles in an aplhabet ! "The primary structure is only taking into account
Bremen
Phylogeny
Genome Annotation
Text search
" Search terms/criteria within sequences annotations: function, keywords, organisms, features " Generic sites :
" Entrez : NCBI server (http:// www.ncbi.nlm.nih.gov/Entrez/) " SRS : available at different sites
Homology search
" Goal: search for sequences similarities in order to infer structural or functional information " Method : sequences alignment or dot matrix " Definitions :
" the similarity to measure the similarity " The homology : " is an hypothesis based on sequence similarity " stipulates that 2 sequences are derived from a common ancestor
P. Thbault- Alignment sequence 7
Evolution of sequences
" The concept of homology relates to the mechanisms of molecular evolution " Principles :
" homologous sequences are derived from a common ancestor which sequence is not available (unfortunately!) " at the molecular level, the events of evolution are substitutions, insertions and deletions " there exists a selection pressure at the structural or functional levels on either genes or their products : this pressure guides sequence evolution
P. Thbault- Alignment sequence 8
Common ancestor
Orthologous
P2a
P2b
F2
Specie 1
Specie 2
Paralogous 10
functional inference
Common ancestor P Database
spciation
time
P1
specie 1
homology
Software to compare sequences
P2
specie2
11
12
: ..
:******
TRY3_ANOGA SAAESNAILRAANIPTVNQKECTIAYSSSGGITDRMLCAGYKRGGKDACQGDSGGPLVV TRY3_HUMAN SFGADYPDELKCLDAPVLREAECKA-SCPGKITNSMFCVGFLEGGKDSWKRDSGGPVVC * .*: *:. : *.:.:Alignment **. *..* **: *:*.*: .****: : *****:* P. Thbaultsequence TRY3_ANOGA DGKLVGVVSWGFGCAMPGYPGVYARVAVVRNWVRENSGA--
13
14
Difficulties
" With time, mutations accumulate until similarity between sequences disappear : homology is not detectable = false negatives " There are mechanisms, independent from evolution, which result in artifact similarities (low complexity regions) similarity but no homology = false positive
15
F12 & PLAT are 2 proteins involved in blood coagulation (the catalytic domain has a serine protease activity). Domains frequently correspond to exons.
17
Bremen
18
19
Protein 2
Alignment of 2 sequences
" Pb : A huge number of possibilities " Which one?
A C - T T A G G C A - G T - G G C * * * * * A C T T A G G C - A G T G G C * * * * A C T T A G G C A G T - G G C * * * * *
Alignment of 2 sequences
" Evaluation " Similarity criteria
4 matchs 2 mismatchs 2 gaps A C T T A G G C - A G T G G C * * * * 5 matchs 0 mismatch 4 gaps A C - T T A G G C A - G T - G G C * * * * * 5 matchs 1 mismatch 2 gaps
A C T T A G G C A G T - G G C * * * * *
Score Systems
" Alignement Score = " scores at each position " Different events: match = +2 " Indel / substitution / identit mismatch = -1 " All substitutions are not equivalents gap = -2 " ADN : transitions / transversions
Proteins : " physico-chimics properties " Models for evoultion " Penality for gaps : " Linear, log
"
BLOSUM62
Algorithms
" How to find the best alignement? " Exacts Algorithms : " Programmation dynamique (Needleman & Wunsch, Smith &
Waterman) " Take time if databases
" Heuristiques = not sure about the optimal solution " Blast, Fasta
Global orlocal ?
" 2 types of alignment :
Needleman & Wunsch Fasta
local : by pieces
For each sequence of the database, the program tries to find the best alignment
P. Thbault- Alignment sequence 28
S : !bit-score! of the alignement K, ! : parameters (score system, sequence composition) m, n : lentgh of sequences (or size of the database)
E = K.m.n.e-!S
E = nb of alignements that we may get in the database with a score more that a score under the random hypothesis
Blast Tools
Blast utilisation
" Questions : " Which database ?
General (GenBank, UniProt) Specialized (EST, limited to one organism, family of proteins, etc.) " Nucleic or proteic sequences? " Are the default parameters adapted? " Interpretation of the results: " Which E-value max ? No simple rule : E < 1e-10 => clear homology 1e-10 < E < 1e10-5 => may be ??? 1e10-5 < E => not significant enough
" "
"
Blast programs
Program blastp blastn blastx tblastn tblastx database proteins proteins nucleotides query seq. comment proteins nucleotides proteins
Translation of the query seq Translation of the database. Translation of the query seq and the database
nucleotides nucleotides
nucleotides nucleotides
32