0% found this document useful (0 votes)
54 views

Sequences Alignments (Similarity & Homology)

The document discusses sequence alignment and homology. It defines a biological sequence as an ordered collection of symbols in an alphabet, and notes that sequence analysis can provide information about a sequence's function and relationships to other molecules. Sequence analysis methods include pairwise alignment, multiple alignment, motif searches, phylogeny, and homology searches using tools like BLAST to compare a query sequence to database sequences and evaluate hits.

Uploaded by

monkey_isaac
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Sequences Alignments (Similarity & Homology)

The document discusses sequence alignment and homology. It defines a biological sequence as an ordered collection of symbols in an alphabet, and notes that sequence analysis can provide information about a sequence's function and relationships to other molecules. Sequence analysis methods include pairwise alignment, multiple alignment, motif searches, phylogeny, and homology searches using tools like BLAST to compare a query sequence to database sequences and evaluate hits.

Uploaded by

monkey_isaac
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Sequences alignments (similarity & homology)

What is a sequence?
!B " ioinformatic : a biological sequence is a simple word ! "A word is a ordonned collection of symboles in an aplhabet ! "The primary structure is only taking into account

Bremen

P. Thbault- Alignment sequence

The sequence is represented through a given format

P. Thbault- Alignment sequence

Sequence analysis, what for?


" A sequence contains information about : - function, - relationships between other molecules " A sequence reflects some physico-chimic constraints due to: " The environment (water, lipid) " the molecular evolution The objectif is to predict important informations about the macromolecular function thanks to the only sequence
P. Thbault- Alignment sequence 4

Sequence analysis, what for?


Multiple Alignment Motifs search

Phylogeny

Databases Sequence Homologous sequences search Pairwise alignment


P. Thbault- Alignment sequence

Genome Annotation

Text search
" Search terms/criteria within sequences annotations: function, keywords, organisms, features " Generic sites :
" Entrez : NCBI server (http:// www.ncbi.nlm.nih.gov/Entrez/) " SRS : available at different sites

" Specialized sites


" SGD (all about yeast : http://genome-www.stanford.edu/ Saccharomyces/)
P. Thbault- Alignment sequence 6

Homology search
" Goal: search for sequences similarities in order to infer structural or functional information " Method : sequences alignment or dot matrix " Definitions :
" the similarity to measure the similarity " The homology : " is an hypothesis based on sequence similarity " stipulates that 2 sequences are derived from a common ancestor
P. Thbault- Alignment sequence 7

Evolution of sequences
" The concept of homology relates to the mechanisms of molecular evolution " Principles :

" homologous sequences are derived from a common ancestor which sequence is not available (unfortunately!) " at the molecular level, the events of evolution are substitutions, insertions and deletions " there exists a selection pressure at the structural or functional levels on either genes or their products : this pressure guides sequence evolution
P. Thbault- Alignment sequence 8

Information inference and evolution


" Most bioinformatics methods rely on information transfer from known sequences towards new sequences : inference reasoning " This inference relies on evolution events such as: " speciation (an ancestor specie => different species) " Genes duplication " merge / split of genes (leads to the domains composition of genes and proteins)

P. Thbault- Alignment sequence

Common ancestor

spciation time P1 P2 duplication P1


F

Orthologous

P2a

P2b

F2

Specie 1

Specie 2

Paralogous 10

functional inference
Common ancestor P Database

fonction F (deduced from homology)

What is the Function of P1?

spciation

function F (experimental work)

time

P1
specie 1

homology
Software to compare sequences

P2
specie2
11

Ex. 1: trypsin, human & chiken (~80 % id.)


TRY3_CHICK MKFLFLILSCLGAAVAFPGGADDDKIVGGYTCPEHSVPYQVSLNSGYHFCGGSLINSQWV TRY3_HUMAN MN-PFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWV *: ****: :*****.* *********** *:*:********* ********..*** TRY3_CHICK LSAAHCYKSRIQVRLGEYNIDVQEDSEVVRSSSVIIRHPKYSSITLNNDIMLIKLASAVE TRY3_HUMAN VSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAV :*******:********:**.* *..* . .:: *******. **:********:*.. TRY3_CHICK YSADIQPIALPSSCAKAGTECLISGWGNTLSNGYNYPELLQCLNAPILSDQECQEAYPGD TRY3_HUMAN INARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLREAECKASCPGK .* :..*:**:: . *************** * :**: *:**:**:* : **: : **. TRY3_CHICK ITSNMICVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGIGCALKGYPGVYTKVCNYVDW TRY3_HUMAN ITNSMFCVGFLEGGKDSWKRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDW P. Thbault-: Alignment sequence **..*:*********** **********:***:**** *** *. ******* *****

12

Ex. 2: trypsin, human & mosquito (~30 % id.)


TRY3_ANOGA MISNKIAILLAVLVVAVACAQARVALKHRSVQALPRFLPRPQYDVGHRIVGGFEIDVSET TRY3_HUMAN --------MNPFLILAFVGAA--V--------AVP------FDDDDKIVGGYTCEENSL : ..*::*.. * * *:* :* ..:****: TRY3_ANOGA PYQVSLQYFNSHRCGGSVLNSKWILTAAHCTVNLQPSSLAVRLGSS--RHASGGTVVRV TRY3_HUMAN PYQVSLN-SGSHFCGGSLISEQWVVSAAHC---YKTRIQVRLGEHNIKVLEGNEQFINA ******: .** ****::..:*:::**** : : ****. : .*. .:.. TRY3_ANOGA ARVLEHPNYDDSTIDYDFSLMELETELTFSDVVQPVSLPEQDEAVEDGTMTTVSGWGNTQ TRY3_HUMAN AKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAA-GTECLISGWGNTL *:::.**:*: .*:* *: *::*.: .:. *..:*** *. **

: ..

:******

TRY3_ANOGA SAAESNAILRAANIPTVNQKECTIAYSSSGGITDRMLCAGYKRGGKDACQGDSGGPLVV TRY3_HUMAN SFGADYPDELKCLDAPVLREAECKA-SCPGKITNSMFCVGFLEGGKDSWKRDSGGPVVC * .*: *:. : *.:.:Alignment **. *..* **: *:*.*: .****: : *****:* P. Thbaultsequence TRY3_ANOGA DGKLVGVVSWGFGCAMPGYPGVYARVAVVRNWVRENSGA--

13

the trypsin case


" Very conserved sequence " Strong structural constraints : 3 cysteines bonds (cys-cys) " Sequence similarity is in accordance with phylogenetic distances of species " Function identity is proved experimentally

P. Thbault- Alignment sequence

14

Difficulties
" With time, mutations accumulate until similarity between sequences disappear : homology is not detectable = false negatives " There are mechanisms, independent from evolution, which result in artifact similarities (low complexity regions) similarity but no homology = false positive

P. Thbault- Alignment sequence

15

Modular composition of proteins


" Many proteins appear as domain combinations " Domains can be repeated and present in different protein in various orders " Similarity (and homology!) between proteins can thus be partial : this makes the alignment more complicated and affect functional inference (a common domain might not be enough to result in a common function)
P. Thbault- Alignment sequence 16

Example of modular proteins


F12 PLAT F2 F1 E E F1 K E K K catalytic catalytic

F12 & PLAT are 2 proteins involved in blood coagulation (the catalytic domain has a serine protease activity). Domains frequently correspond to exons.

P. Thbault- Alignment sequence

17

How to compare 2 sequences?


" Based on a graphical view -> Dot matrix approach " Based on a sequence view ->Alignment approach

Bremen

P. Thbault- Alignment sequence

18

!dot matrix! view

P. Thbault- Alignment sequence

19

Protein 2

Dot - Matrix Protein 1


P. Thbault- Alignment sequence 20

Alignment of 2 sequences
" Pb : A huge number of possibilities " Which one?
A C - T T A G G C A - G T - G G C * * * * * A C T T A G G C - A G T G G C * * * * A C T T A G G C A G T - G G C * * * * *

Alignment of 2 sequences
" Evaluation " Similarity criteria
4 matchs 2 mismatchs 2 gaps A C T T A G G C - A G T G G C * * * * 5 matchs 0 mismatch 4 gaps A C - T T A G G C A - G T - G G C * * * * * 5 matchs 1 mismatch 2 gaps

A C T T A G G C A G T - G G C * * * * *

Score Systems
" Alignement Score = " scores at each position " Different events: match = +2 " Indel / substitution / identit mismatch = -1 " All substitutions are not equivalents gap = -2 " ADN : transitions / transversions
Proteins : " physico-chimics properties " Models for evoultion " Penality for gaps : " Linear, log

"

Matrix of substitution Opening Extending

Matrix of substitutions (aa)

BLOSUM62

Matrix of substitutions (aa)


matchs : always > 0, but different scores BLOSUM62 mismatchs :
<0 : penality =0 : neutral >0 : neatrly like a match

Algorithms
" How to find the best alignement? " Exacts Algorithms : " Programmation dynamique (Needleman & Wunsch, Smith &
Waterman) " Take time if databases

" Heuristiques = not sure about the optimal solution " Blast, Fasta

Global orlocal ?
" 2 types of alignment :
Needleman & Wunsch Fasta

global : total length

local : by pieces

Smith & Waterman Blast

Comparing a sequence with those of a databases


The goal is to compare a query sequence all the subject sequences of the database
sequence database

For each sequence of the database, the program tries to find the best alignment
P. Thbault- Alignment sequence 28

Blast Hit evaluation


" Satistic evaluation: random ? " E-value

" " "

S : !bit-score! of the alignement K, ! : parameters (score system, sequence composition) m, n : lentgh of sequences (or size of the database)

E = K.m.n.e-!S

E = nb of alignements that we may get in the database with a score more that a score under the random hypothesis

Blast Tools

Blast utilisation
" Questions : " Which database ?
General (GenBank, UniProt) Specialized (EST, limited to one organism, family of proteins, etc.) " Nucleic or proteic sequences? " Are the default parameters adapted? " Interpretation of the results: " Which E-value max ? No simple rule : E < 1e-10 => clear homology 1e-10 < E < 1e10-5 => may be ??? 1e10-5 < E => not significant enough

" "

"

But also : size, %id, %gap = to examin alignements

Blast programs
Program blastp blastn blastx tblastn tblastx database proteins proteins nucleotides query seq. comment proteins nucleotides proteins
Translation of the query seq Translation of the database. Translation of the query seq and the database

nucleotides nucleotides

nucleotides nucleotides

P. Thbault- Alignment sequence

32

You might also like